[Openstack] Troubleshooting Swift 1.7.4 on mini servers

Nathan Trueblood nathan at truebloodllc.com
Mon Oct 29 15:31:00 UTC 2012


Yeah, I read about the 507 error.    However, when the error occurs on my I
can see with 'df' that the drive is only 1% full and is definitely not
unmounted.   I can write files to the mounted filesystem directly before,
during, and after the Swift error occurs.   So the problem must be some
kind of timeout that is causing the object server to think that something
is wrong with the disk.

I'll keep digging...

On Fri, Oct 26, 2012 at 11:21 PM, John Dickinson <me at not.mn> wrote:

> A 507 is returned by the object servers in 2 situations: 1) the drives are
> full or 2) the drives have been unmounted because of disk error.
>
> It's highly likely that you simply have full drives. Remember that the
> usable space in your cluster is 1/N where N = replica count. As an example,
> with 3 replicas and 5 nodes with a single 1TB drive each, you only have
> about 1.6TB available for data.
>
> As Pete suggested in his response, how big are your drives, and what does
> `df` tell you?
>
> --John
>
>
> On Oct 26, 2012, at 5:26 PM, Nathan Trueblood <nathan at truebloodllc.com>
> wrote:
>
> > Hey folks-
> >
> > I'm trying to figure out what's going wrong with my Swift deployment on
> a small cluster of "mini" servers.   I have a small test cluster (5 storage
> nodes, 1 proxy) of mini-servers that are ARM-based.   The proxy is a
> regular, Intel-based server with plenty of RAM.   The
> object/account/container servers are relatively small, with 2GB of RAM per
> node.
> >
> > Everything starts up fine, but now I'm trying to troubleshoot a strange
> problem.   After I successfully upload a few test files, it seems like the
> storage system stops responding and the proxy gives me a 503 error.
> >
> > Here's the test sequence I run on my proxy:
> >
> > lab at proxy01:~/bin$ ./swiftcl.sh stat
> > swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass stat
> >    Account: AUTH_system
> > Containers: 5
> >    Objects: 4
> >      Bytes: 47804968
> > Accept-Ranges: bytes
> > X-Timestamp: 1351294912.72119
> > lab at proxy01:~/bin$ ./swiftcl.sh upload myfiles1 /home/lab/bigfile1
> > swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass
> upload myfiles1 /home/lab/bigfile1
> > home/lab/bigfile1
> > lab at proxy01:~/bin$ ./swiftcl.sh upload myfiles2 /home/lab/bigfile1
> > swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass
> upload myfiles2 /home/lab/bigfile1
> > home/lab/bigfile1
> > lab at proxy01:~/bin$ ./swiftcl.sh upload myfiles3 /home/lab/bigfile1
> > swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass
> upload myfiles3 /home/lab/bigfile1
> > home/lab/bigfile1
> > lab at proxy01:~/bin$ ./swiftcl.sh upload myfiles4 /home/lab/bigfile1
> > swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass
> upload myfiles4 /home/lab/bigfile1
> > home/lab/bigfile1
> > lab at proxy01:~/bin$ ./swiftcl.sh upload myfiles5 /home/lab/bigfile1
> > swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass
> upload myfiles5 /home/lab/bigfile1
> > Object PUT failed:
> http://172.16.1.111:8080/v1/AUTH_system/myfiles5/home/lab/bigfile1 503
> Service Unavailable  [first 60 chars of response] 503 Service Unavailable
> >
> > The server is currently unavailable
> > lab at proxy01:~/bin$ ./swiftcl.sh stat
> > swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass stat
> >    Account: AUTH_system
> > Containers: 6
> >    Objects: 5
> >      Bytes: 59756210
> > Accept-Ranges: bytes
> > X-Timestamp: 1351294912.72119
> >
> > Here's the corresponding log on the Proxy:
> >
> > Oct 26 17:06:52 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/06/52
> GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0010
> > Oct 26 17:07:13 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/13
> GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0017
> > Oct 26 17:07:13 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/13
> GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0016
> > Oct 26 17:07:22 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/22
> GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0010
> > Oct 26 17:07:22 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/22
> GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0016
> > Oct 26 17:07:27 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/27
> GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0010
> > Oct 26 17:07:27 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/27
> GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0016
> > Oct 26 17:07:27 proxy01 proxy-server Handoff requested (1) (txn:
> tx6946419daba54efe9c2878f8a2a78f88) (client_ip: 172.16.1.111)
> > Oct 26 17:07:27 proxy01 proxy-server Handoff requested (2) (txn:
> tx6946419daba54efe9c2878f8a2a78f88) (client_ip: 172.16.1.111)
> > Oct 26 17:07:33 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/33
> GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0010
> > Oct 26 17:07:33 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/33
> GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0016
> > Oct 26 17:07:33 proxy01 proxy-server Handoff requested (1) (txn:
> tx5f9659f74cb2491f9a63cbb84f680c5c) (client_ip: 172.16.1.111)
> > Oct 26 17:07:33 proxy01 proxy-server Handoff requested (2) (txn:
> tx5f9659f74cb2491f9a63cbb84f680c5c) (client_ip: 172.16.1.111)
> > Oct 26 17:07:39 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/39
> GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0009
> > Oct 26 17:07:39 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/39
> GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0009
> > Oct 26 17:07:39 proxy01 proxy-server Handoff requested (1) (txn:
> tx8dc917a4a8c84c40a4429b7bab0323c6) (client_ip: 172.16.1.111)
> > Oct 26 17:07:39 proxy01 proxy-server Handoff requested (2) (txn:
> tx8dc917a4a8c84c40a4429b7bab0323c6) (client_ip: 172.16.1.111)
> > Oct 26 17:07:40 proxy01 proxy-server Object PUT returning 503, 1/2
> required connections (txn: tx8dc917a4a8c84c40a4429b7bab0323c6) (client_ip:
> 172.16.1.111)
> > Oct 26 17:07:41 proxy01 proxy-server Object PUT returning 503, 1/2
> required connections (txn: tx07a1f5dfaa23445a88eaa4a2ade68466) (client_ip:
> 172.16.1.111)
> > Oct 26 17:07:43 proxy01 proxy-server Object PUT returning 503, 1/2
> required connections (txn: tx938d08b706844db3886695b798bd9fad) (client_ip:
> 172.16.1.111)
> > Oct 26 17:07:47 proxy01 proxy-server Object PUT returning 503, 1/2
> required connections (txn: txa35e9f8a54924f139e13d6f3a5dc457f) (client_ip:
> 172.16.1.111)
> > Oct 26 17:07:55 proxy01 proxy-server Object PUT returning 503, 1/2
> required connections (txn: txc2f5c35b54a14b32a2d0178423bc7771) (client_ip:
> 172.16.1.111)
> > Oct 26 17:08:11 proxy01 proxy-server Object PUT returning 503, 1/2
> required connections (txn: txd8df89e76edd40f3b4c04091b3cbc857) (client_ip:
> 172.16.1.111)
> > Oct 26 17:08:17 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/08/17
> GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0011
> >
> > And here's the log from one of the Storage Nodes:
> >
> > Oct 26 17:07:20 data05 container-server 192.168.1.111 - -
> [27/Oct/2012:00:07:20 +0000] "PUT /sda6/200589/AUTH_system/myfiles1" 201 -
> "tx063cd0643ef64d169b3bc5a86a808df4" "-" "-" 0.0510
> > Oct 26 17:07:20 data05 account-server 192.168.1.204 - -
> [27/Oct/2012:00:07:20 +0000] "PUT /sda6/116021/AUTH_system/myfiles1" 201 -
> "tx063cd0643ef64d169b3bc5a86a808df4" "-" "-" 0.0154 ""
> > Oct 26 17:07:20 data05 object-server 192.168.1.111 - -
> [27/Oct/2012:00:07:20 +0000] "HEAD
> /sda6/36256/AUTH_system/myfiles1/home/lab/bigfile1" 404 - "-"
> "tx6f1018e5da4d4974a85d544a0874a581" "-" 0.0015
> > Oct 26 17:07:21 data05 object-server 192.168.1.111 - -
> [27/Oct/2012:00:07:21 +0000] "PUT
> /sda6/36256/AUTH_system/myfiles1/home/lab/bigfile1" 201 - "-"
> "tx59f931d3281d41159eba0e0c88c18107" "-" 1.0255
> > Oct 26 17:07:21 data05 container-server 192.168.1.202 - -
> [27/Oct/2012:00:07:21 +0000] "PUT
> /sda6/200589/AUTH_system/myfiles1/home/lab/bigfile1" 201 -
> "tx59f931d3281d41159eba0e0c88c18107" "-" "-" 0.0218
> > Oct 26 17:07:24 data05 account-replicator Beginning replication run
> > Oct 26 17:07:25 data05 account-replicator Replication run OVER
> > Oct 26 17:07:25 data05 account-replicator Attempted to replicate 1 dbs
> in 0.21688 seconds (4.61073/s)
> > Oct 26 17:07:25 data05 account-replicator Removed 0 dbs
> > Oct 26 17:07:25 data05 account-replicator 2 successes, 0 failures
> > Oct 26 17:07:25 data05 account-replicator no_change:0 ts_repl:0 diff:2
> rsync:0 diff_capped:0 hashmatch:0 empty:0
> > Oct 26 17:07:28 data05 account-server 192.168.1.204 - -
> [27/Oct/2012:00:07:28 +0000] "PUT /sda6/116021/AUTH_system/myfiles1" 201 -
> "-" "-" "-" 0.0021 ""
> > Oct 26 17:07:29 data05 account-server 192.168.1.205 - -
> [27/Oct/2012:00:07:29 +0000] "PUT /sda6/116021/AUTH_system/myfiles2" 201 -
> "txbace9f92a41a4645beb9159a802fc12d" "-" "-" 0.0178 ""
> > Oct 26 17:07:30 data05 object-server 192.168.1.111 - -
> [27/Oct/2012:00:07:30 +0000] "PUT
> /sda6/63086/AUTH_system/myfiles2/home/lab/bigfile1" 201 - "-"
> "txa4fac7ceab2948ae9010b4ce4cb56b97" "-" 0.7135
> > Oct 26 17:07:33 data05 object-replicator Starting object replication
> pass.
> > Oct 26 17:07:33 data05 object-replicator 4/4 (100.00%) partitions
> replicated in 0.17s (23.71/sec, 0s remaining)
> > Oct 26 17:07:33 data05 object-replicator 4 suffixes checked - 50.00%
> hashed, 0.00% synced
> > Oct 26 17:07:33 data05 object-replicator Partition times: max 0.1012s,
> min 0.0164s, med 0.0273s
> > Oct 26 17:07:33 data05 object-replicator Object replication complete.
> (0.00 minutes)
> > Oct 26 17:07:34 data05 account-server 192.168.1.206 - -
> [27/Oct/2012:00:07:34 +0000] "PUT /sda6/116021/AUTH_system/myfiles3" 201 -
> "txa547450c156d4df9ad7e42c04dcabd8f" "-" "-" 0.0168 ""
> > Oct 26 17:07:34 data05 container-server 192.168.1.111 - -
> [27/Oct/2012:00:07:34 +0000] "PUT /sda6/118594/AUTH_system/myfiles3" 202 -
> "txa547450c156d4df9ad7e42c04dcabd8f" "-" "-" 0.0399
> > Oct 26 17:07:35 data05 container-server 192.168.1.206 - -
> [27/Oct/2012:00:07:35 +0000] "PUT
> /sda6/118594/AUTH_system/myfiles3/home/lab/bigfile1" 201 -
> "tx6946419daba54efe9c2878f8a2a78f88" "-" "-" 0.0022
> > Oct 26 17:07:35 data05 object-server 192.168.1.111 - -
> [27/Oct/2012:00:07:35 +0000] "PUT
> /sda6/35436/AUTH_system/myfiles3/home/lab/bigfile1" 201 - "-"
> "tx6946419daba54efe9c2878f8a2a78f88" "-" 0.6806
> > Oct 26 17:07:36 data05 account-server 192.168.1.205 - -
> [27/Oct/2012:00:07:36 +0000] "PUT /sda6/116021/AUTH_system/myfiles2" 201 -
> "-" "-" "-" 0.0021 ""
> > Oct 26 17:07:37 data05 object-auditor Begin object audit "forever" mode
> (ZBF)
> > Oct 26 17:07:37 data05 object-auditor Object audit (ZBF) "forever" mode
> completed: 0.04s. Total quarantined: 0, Total errors: 0, Total files/sec:
> 89.46 , Total bytes/sec: 0.00, Auditing time: 0.04, Rate: 0.95
> > Oct 26 17:07:40 data05 account-server 192.168.1.204 - -
> [27/Oct/2012:00:07:40 +0000] "PUT /sda6/116021/AUTH_system/myfiles4" 201 -
> "txe095c94f8e5c4c4681a8d6a9c330bb7b" "-" "-" 0.0200 ""
> > Oct 26 17:07:40 data05 container-server 192.168.1.111 - -
> [27/Oct/2012:00:07:40 +0000] "PUT /sda6/217542/AUTH_system/myfiles4" 202 -
> "txe095c94f8e5c4c4681a8d6a9c330bb7b" "-" "-" 0.0671
> > Oct 26 17:07:40 data05 container-server 192.168.1.206 - -
> [27/Oct/2012:00:07:40 +0000] "PUT
> /sda6/217542/AUTH_system/myfiles4/home/lab/bigfile1" 201 -
> "tx5f9659f74cb2491f9a63cbb84f680c5c" "-" "-" 0.0023
> > Oct 26 17:07:40 data05 object-server 192.168.1.111 - -
> [27/Oct/2012:00:07:40 +0000] "PUT
> /sda6/49697/AUTH_system/myfiles4/home/lab/bigfile1" 201 - "-"
> "tx5f9659f74cb2491f9a63cbb84f680c5c" "-" 0.6038
> > Oct 26 17:07:44 data05 container-updater Begin container update sweep
> > Oct 26 17:07:44 data05 container-replicator Beginning replication run
> > Oct 26 17:07:44 data05 account-server 192.168.1.206 - -
> [27/Oct/2012:00:07:44 +0000] "PUT /sda6/116021/AUTH_system/myfiles1" 201 -
> "-" "-" "-" 0.0021 ""
> > Oct 26 17:07:44 data05 account-server 192.168.1.206 - -
> [27/Oct/2012:00:07:44 +0000] "PUT /sda6/116021/AUTH_system/myfiles3" 201 -
> "-" "-" "-" 0.0021 ""
> > Oct 26 17:07:44 data05 account-server 192.168.1.206 - -
> [27/Oct/2012:00:07:44 +0000] "PUT /sda6/116021/AUTH_system/myfiles4" 201 -
> "-" "-" "-" 0.0021 ""
> > Oct 26 17:07:45 data05 account-server 192.168.1.203 - -
> [27/Oct/2012:00:07:45 +0000] "PUT /sda6/116021/AUTH_system/myfiles2" 201 -
> "-" "-" "-" 0.0023 ""
> > Oct 26 17:07:45 data05 account-server 192.168.1.203 - -
> [27/Oct/2012:00:07:45 +0000] "PUT /sda6/116021/AUTH_system/myfiles1" 201 -
> "-" "-" "-" 0.0020 ""
> > Oct 26 17:07:45 data05 container-replicator Replication run OVER
> > Oct 26 17:07:45 data05 account-server 192.168.1.203 - -
> [27/Oct/2012:00:07:45 +0000] "PUT /sda6/116021/AUTH_system/myfiles3" 201 -
> "-" "-" "-" 0.0223 ""
> > Oct 26 17:07:45 data05 container-replicator Attempted to replicate 5 dbs
> in 0.83001 seconds (6.02399/s)
> > Oct 26 17:07:45 data05 container-replicator Removed 0 dbs
> > Oct 26 17:07:45 data05 container-replicator 10 successes, 0 failures
> > Oct 26 17:07:45 data05 container-replicator no_change:8 ts_repl:0 diff:2
> rsync:0 diff_capped:0 hashmatch:0 empty:0
> > Oct 26 17:07:45 data05 container-updater Container update sweep
> completed: 0.98s
> > Oct 26 17:07:46 data05 account-server 192.168.1.205 - -
> [27/Oct/2012:00:07:46 +0000] "PUT /sda6/116021/AUTH_system/myfiles5" 201 -
> "tx7ba0264da03c485ca3c3b4048f6a3d8c" "-" "-" 0.0098 ""
> > Oct 26 17:07:46 data05 container-server 192.168.1.111 - -
> [27/Oct/2012:00:07:46 +0000] "PUT /sda6/122255/AUTH_system/myfiles5" 202 -
> "tx7ba0264da03c485ca3c3b4048f6a3d8c" "-" "-" 0.0465
> > Oct 26 17:07:46 data05 object-server 192.168.1.111 - -
> [27/Oct/2012:00:07:46 +0000] "HEAD
> /sda6/150861/AUTH_system/myfiles5/home/lab/bigfile1" 404 - "-"
> "txf08e8b91b2894a9da099f26b5300327b" "-" 0.0011
> > Oct 26 17:07:46 data05 object-server 192.168.1.111 - -
> [27/Oct/2012:00:07:46 +0000] "PUT
> /sda6/150861/AUTH_system/myfiles5/home/lab/bigfile1" 507 - "-"
> "tx8dc917a4a8c84c40a4429b7bab0323c6" "-" 0.0031
> >
> > Right now, there is nothing special in the object-server.conf:
> >
> > [DEFAULT]
> > bind_ip = 192.168.1.206
> > workers = 1
> >
> > [pipeline:main]
> > pipeline = object-server
> >
> > [app:object-server]
> > use = egg:swift#object
> >
> > [object-replicator]
> >
> > [object-updater]
> >
> > [object-auditor]
> >
> > [object-server]
> > set log_level = DEBUG
> >
> > My general question is: how do I troubleshoot this?   I've tried turning
> on debug logging on the Object Server, but I don't get any additional clues.
> >
> > The Object-servers do give a 507 error, which might indicate a disk
> problem, but there is nothing wrong with the storage drive.   And also if
> there was a fundamental drive problem then I wouldn't be able to upload
> objects in the first place.
> >
> > My best guess at the moment is that I should adjust the configuration of
> the object servers to reflect their lower performance.   What would I
> adjust, the cache size?   I'm not seeing any log messages that would
> indicate a time-out.
> >
> > Thanks in advance for your advice.
> >
> > -N
> > _______________________________________________
> > Mailing list: https://launchpad.net/~openstack
> > Post to     : openstack at lists.launchpad.net
> > Unsubscribe : https://launchpad.net/~openstack
> > More help   : https://help.launchpad.net/ListHelp
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20121029/f778dc22/attachment.html>


More information about the Openstack mailing list