[Openstack] Troubleshooting Swift 1.7.4 on mini servers

Diego Parrilla diego.parrilla.santamaria at gmail.com
Sat Oct 27 11:32:54 UTC 2012


Sorry for my off topic question, it's the first time I heard of swift storage servers running an ARM processor. 

I think ARM architectures can be an interesting alternative to replace classic computing servers, but swift storage nodes did not catch my attention as a valid alternative, until now.

From my perspective, the savings in power consumption and real state in your datacenter are minimized if you load your servers with two dozen SATA disks (a typical swift node configuration) and a dual 10GbE connection, for example.

May be there are benefits I'm not aware of, but I would really love to hear about them :-)

Cheers
Diego

Enviado desde mi iPhone, perdona la brevedad

El 27/10/2012, a las 02:26, Nathan Trueblood <nathan at truebloodllc.com> escribió:

> Hey folks-
> 
> I'm trying to figure out what's going wrong with my Swift deployment on a small cluster of "mini" servers.   I have a small test cluster (5 storage nodes, 1 proxy) of mini-servers that are ARM-based.   The proxy is a regular, Intel-based server with plenty of RAM.   The object/account/container servers are relatively small, with 2GB of RAM per node.
> 
> Everything starts up fine, but now I'm trying to troubleshoot a strange problem.   After I successfully upload a few test files, it seems like the storage system stops responding and the proxy gives me a 503 error.
> 
> Here's the test sequence I run on my proxy:
> 
> lab at proxy01:~/bin$ ./swiftcl.sh stat
> swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass stat
>    Account: AUTH_system
> Containers: 5
>    Objects: 4
>      Bytes: 47804968
> Accept-Ranges: bytes
> X-Timestamp: 1351294912.72119
> lab at proxy01:~/bin$ ./swiftcl.sh upload myfiles1 /home/lab/bigfile1 
> swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass upload myfiles1 /home/lab/bigfile1
> home/lab/bigfile1
> lab at proxy01:~/bin$ ./swiftcl.sh upload myfiles2 /home/lab/bigfile1 
> swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass upload myfiles2 /home/lab/bigfile1
> home/lab/bigfile1
> lab at proxy01:~/bin$ ./swiftcl.sh upload myfiles3 /home/lab/bigfile1 
> swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass upload myfiles3 /home/lab/bigfile1
> home/lab/bigfile1
> lab at proxy01:~/bin$ ./swiftcl.sh upload myfiles4 /home/lab/bigfile1 
> swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass upload myfiles4 /home/lab/bigfile1
> home/lab/bigfile1
> lab at proxy01:~/bin$ ./swiftcl.sh upload myfiles5 /home/lab/bigfile1 
> swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass upload myfiles5 /home/lab/bigfile1
> Object PUT failed: http://172.16.1.111:8080/v1/AUTH_system/myfiles5/home/lab/bigfile1 503 Service Unavailable  [first 60 chars of response] 503 Service Unavailable
> 
> The server is currently unavailable
> lab at proxy01:~/bin$ ./swiftcl.sh stat
> swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass stat
>    Account: AUTH_system
> Containers: 6
>    Objects: 5
>      Bytes: 59756210
> Accept-Ranges: bytes
> X-Timestamp: 1351294912.72119
> 
> Here's the corresponding log on the Proxy:
> 
> Oct 26 17:06:52 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/06/52 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0010
> Oct 26 17:07:13 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/13 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0017
> Oct 26 17:07:13 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/13 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0016
> Oct 26 17:07:22 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/22 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0010
> Oct 26 17:07:22 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/22 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0016
> Oct 26 17:07:27 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/27 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0010
> Oct 26 17:07:27 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/27 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0016
> Oct 26 17:07:27 proxy01 proxy-server Handoff requested (1) (txn: tx6946419daba54efe9c2878f8a2a78f88) (client_ip: 172.16.1.111)
> Oct 26 17:07:27 proxy01 proxy-server Handoff requested (2) (txn: tx6946419daba54efe9c2878f8a2a78f88) (client_ip: 172.16.1.111)
> Oct 26 17:07:33 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/33 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0010
> Oct 26 17:07:33 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/33 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0016
> Oct 26 17:07:33 proxy01 proxy-server Handoff requested (1) (txn: tx5f9659f74cb2491f9a63cbb84f680c5c) (client_ip: 172.16.1.111)
> Oct 26 17:07:33 proxy01 proxy-server Handoff requested (2) (txn: tx5f9659f74cb2491f9a63cbb84f680c5c) (client_ip: 172.16.1.111)
> Oct 26 17:07:39 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/39 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0009
> Oct 26 17:07:39 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/39 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0009
> Oct 26 17:07:39 proxy01 proxy-server Handoff requested (1) (txn: tx8dc917a4a8c84c40a4429b7bab0323c6) (client_ip: 172.16.1.111)
> Oct 26 17:07:39 proxy01 proxy-server Handoff requested (2) (txn: tx8dc917a4a8c84c40a4429b7bab0323c6) (client_ip: 172.16.1.111)
> Oct 26 17:07:40 proxy01 proxy-server Object PUT returning 503, 1/2 required connections (txn: tx8dc917a4a8c84c40a4429b7bab0323c6) (client_ip: 172.16.1.111)
> Oct 26 17:07:41 proxy01 proxy-server Object PUT returning 503, 1/2 required connections (txn: tx07a1f5dfaa23445a88eaa4a2ade68466) (client_ip: 172.16.1.111)
> Oct 26 17:07:43 proxy01 proxy-server Object PUT returning 503, 1/2 required connections (txn: tx938d08b706844db3886695b798bd9fad) (client_ip: 172.16.1.111)
> Oct 26 17:07:47 proxy01 proxy-server Object PUT returning 503, 1/2 required connections (txn: txa35e9f8a54924f139e13d6f3a5dc457f) (client_ip: 172.16.1.111)
> Oct 26 17:07:55 proxy01 proxy-server Object PUT returning 503, 1/2 required connections (txn: txc2f5c35b54a14b32a2d0178423bc7771) (client_ip: 172.16.1.111)
> Oct 26 17:08:11 proxy01 proxy-server Object PUT returning 503, 1/2 required connections (txn: txd8df89e76edd40f3b4c04091b3cbc857) (client_ip: 172.16.1.111)
> Oct 26 17:08:17 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/08/17 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0011
> 
> And here's the log from one of the Storage Nodes:
> 
> Oct 26 17:07:20 data05 container-server 192.168.1.111 - - [27/Oct/2012:00:07:20 +0000] "PUT /sda6/200589/AUTH_system/myfiles1" 201 - "tx063cd0643ef64d169b3bc5a86a808df4" "-" "-" 0.0510
> Oct 26 17:07:20 data05 account-server 192.168.1.204 - - [27/Oct/2012:00:07:20 +0000] "PUT /sda6/116021/AUTH_system/myfiles1" 201 - "tx063cd0643ef64d169b3bc5a86a808df4" "-" "-" 0.0154 ""
> Oct 26 17:07:20 data05 object-server 192.168.1.111 - - [27/Oct/2012:00:07:20 +0000] "HEAD /sda6/36256/AUTH_system/myfiles1/home/lab/bigfile1" 404 - "-" "tx6f1018e5da4d4974a85d544a0874a581" "-" 0.0015
> Oct 26 17:07:21 data05 object-server 192.168.1.111 - - [27/Oct/2012:00:07:21 +0000] "PUT /sda6/36256/AUTH_system/myfiles1/home/lab/bigfile1" 201 - "-" "tx59f931d3281d41159eba0e0c88c18107" "-" 1.0255
> Oct 26 17:07:21 data05 container-server 192.168.1.202 - - [27/Oct/2012:00:07:21 +0000] "PUT /sda6/200589/AUTH_system/myfiles1/home/lab/bigfile1" 201 - "tx59f931d3281d41159eba0e0c88c18107" "-" "-" 0.0218
> Oct 26 17:07:24 data05 account-replicator Beginning replication run
> Oct 26 17:07:25 data05 account-replicator Replication run OVER
> Oct 26 17:07:25 data05 account-replicator Attempted to replicate 1 dbs in 0.21688 seconds (4.61073/s)
> Oct 26 17:07:25 data05 account-replicator Removed 0 dbs
> Oct 26 17:07:25 data05 account-replicator 2 successes, 0 failures
> Oct 26 17:07:25 data05 account-replicator no_change:0 ts_repl:0 diff:2 rsync:0 diff_capped:0 hashmatch:0 empty:0
> Oct 26 17:07:28 data05 account-server 192.168.1.204 - - [27/Oct/2012:00:07:28 +0000] "PUT /sda6/116021/AUTH_system/myfiles1" 201 - "-" "-" "-" 0.0021 ""
> Oct 26 17:07:29 data05 account-server 192.168.1.205 - - [27/Oct/2012:00:07:29 +0000] "PUT /sda6/116021/AUTH_system/myfiles2" 201 - "txbace9f92a41a4645beb9159a802fc12d" "-" "-" 0.0178 ""
> Oct 26 17:07:30 data05 object-server 192.168.1.111 - - [27/Oct/2012:00:07:30 +0000] "PUT /sda6/63086/AUTH_system/myfiles2/home/lab/bigfile1" 201 - "-" "txa4fac7ceab2948ae9010b4ce4cb56b97" "-" 0.7135
> Oct 26 17:07:33 data05 object-replicator Starting object replication pass.
> Oct 26 17:07:33 data05 object-replicator 4/4 (100.00%) partitions replicated in 0.17s (23.71/sec, 0s remaining)
> Oct 26 17:07:33 data05 object-replicator 4 suffixes checked - 50.00% hashed, 0.00% synced
> Oct 26 17:07:33 data05 object-replicator Partition times: max 0.1012s, min 0.0164s, med 0.0273s
> Oct 26 17:07:33 data05 object-replicator Object replication complete. (0.00 minutes)
> Oct 26 17:07:34 data05 account-server 192.168.1.206 - - [27/Oct/2012:00:07:34 +0000] "PUT /sda6/116021/AUTH_system/myfiles3" 201 - "txa547450c156d4df9ad7e42c04dcabd8f" "-" "-" 0.0168 ""
> Oct 26 17:07:34 data05 container-server 192.168.1.111 - - [27/Oct/2012:00:07:34 +0000] "PUT /sda6/118594/AUTH_system/myfiles3" 202 - "txa547450c156d4df9ad7e42c04dcabd8f" "-" "-" 0.0399
> Oct 26 17:07:35 data05 container-server 192.168.1.206 - - [27/Oct/2012:00:07:35 +0000] "PUT /sda6/118594/AUTH_system/myfiles3/home/lab/bigfile1" 201 - "tx6946419daba54efe9c2878f8a2a78f88" "-" "-" 0.0022
> Oct 26 17:07:35 data05 object-server 192.168.1.111 - - [27/Oct/2012:00:07:35 +0000] "PUT /sda6/35436/AUTH_system/myfiles3/home/lab/bigfile1" 201 - "-" "tx6946419daba54efe9c2878f8a2a78f88" "-" 0.6806
> Oct 26 17:07:36 data05 account-server 192.168.1.205 - - [27/Oct/2012:00:07:36 +0000] "PUT /sda6/116021/AUTH_system/myfiles2" 201 - "-" "-" "-" 0.0021 ""
> Oct 26 17:07:37 data05 object-auditor Begin object audit "forever" mode (ZBF)
> Oct 26 17:07:37 data05 object-auditor Object audit (ZBF) "forever" mode completed: 0.04s. Total quarantined: 0, Total errors: 0, Total files/sec: 89.46 , Total bytes/sec: 0.00, Auditing time: 0.04, Rate: 0.95
> Oct 26 17:07:40 data05 account-server 192.168.1.204 - - [27/Oct/2012:00:07:40 +0000] "PUT /sda6/116021/AUTH_system/myfiles4" 201 - "txe095c94f8e5c4c4681a8d6a9c330bb7b" "-" "-" 0.0200 ""
> Oct 26 17:07:40 data05 container-server 192.168.1.111 - - [27/Oct/2012:00:07:40 +0000] "PUT /sda6/217542/AUTH_system/myfiles4" 202 - "txe095c94f8e5c4c4681a8d6a9c330bb7b" "-" "-" 0.0671
> Oct 26 17:07:40 data05 container-server 192.168.1.206 - - [27/Oct/2012:00:07:40 +0000] "PUT /sda6/217542/AUTH_system/myfiles4/home/lab/bigfile1" 201 - "tx5f9659f74cb2491f9a63cbb84f680c5c" "-" "-" 0.0023
> Oct 26 17:07:40 data05 object-server 192.168.1.111 - - [27/Oct/2012:00:07:40 +0000] "PUT /sda6/49697/AUTH_system/myfiles4/home/lab/bigfile1" 201 - "-" "tx5f9659f74cb2491f9a63cbb84f680c5c" "-" 0.6038
> Oct 26 17:07:44 data05 container-updater Begin container update sweep
> Oct 26 17:07:44 data05 container-replicator Beginning replication run
> Oct 26 17:07:44 data05 account-server 192.168.1.206 - - [27/Oct/2012:00:07:44 +0000] "PUT /sda6/116021/AUTH_system/myfiles1" 201 - "-" "-" "-" 0.0021 ""
> Oct 26 17:07:44 data05 account-server 192.168.1.206 - - [27/Oct/2012:00:07:44 +0000] "PUT /sda6/116021/AUTH_system/myfiles3" 201 - "-" "-" "-" 0.0021 ""
> Oct 26 17:07:44 data05 account-server 192.168.1.206 - - [27/Oct/2012:00:07:44 +0000] "PUT /sda6/116021/AUTH_system/myfiles4" 201 - "-" "-" "-" 0.0021 ""
> Oct 26 17:07:45 data05 account-server 192.168.1.203 - - [27/Oct/2012:00:07:45 +0000] "PUT /sda6/116021/AUTH_system/myfiles2" 201 - "-" "-" "-" 0.0023 ""
> Oct 26 17:07:45 data05 account-server 192.168.1.203 - - [27/Oct/2012:00:07:45 +0000] "PUT /sda6/116021/AUTH_system/myfiles1" 201 - "-" "-" "-" 0.0020 ""
> Oct 26 17:07:45 data05 container-replicator Replication run OVER
> Oct 26 17:07:45 data05 account-server 192.168.1.203 - - [27/Oct/2012:00:07:45 +0000] "PUT /sda6/116021/AUTH_system/myfiles3" 201 - "-" "-" "-" 0.0223 ""
> Oct 26 17:07:45 data05 container-replicator Attempted to replicate 5 dbs in 0.83001 seconds (6.02399/s)
> Oct 26 17:07:45 data05 container-replicator Removed 0 dbs
> Oct 26 17:07:45 data05 container-replicator 10 successes, 0 failures
> Oct 26 17:07:45 data05 container-replicator no_change:8 ts_repl:0 diff:2 rsync:0 diff_capped:0 hashmatch:0 empty:0
> Oct 26 17:07:45 data05 container-updater Container update sweep completed: 0.98s
> Oct 26 17:07:46 data05 account-server 192.168.1.205 - - [27/Oct/2012:00:07:46 +0000] "PUT /sda6/116021/AUTH_system/myfiles5" 201 - "tx7ba0264da03c485ca3c3b4048f6a3d8c" "-" "-" 0.0098 ""
> Oct 26 17:07:46 data05 container-server 192.168.1.111 - - [27/Oct/2012:00:07:46 +0000] "PUT /sda6/122255/AUTH_system/myfiles5" 202 - "tx7ba0264da03c485ca3c3b4048f6a3d8c" "-" "-" 0.0465
> Oct 26 17:07:46 data05 object-server 192.168.1.111 - - [27/Oct/2012:00:07:46 +0000] "HEAD /sda6/150861/AUTH_system/myfiles5/home/lab/bigfile1" 404 - "-" "txf08e8b91b2894a9da099f26b5300327b" "-" 0.0011
> Oct 26 17:07:46 data05 object-server 192.168.1.111 - - [27/Oct/2012:00:07:46 +0000] "PUT /sda6/150861/AUTH_system/myfiles5/home/lab/bigfile1" 507 - "-" "tx8dc917a4a8c84c40a4429b7bab0323c6" "-" 0.0031
> 
> Right now, there is nothing special in the object-server.conf:
> 
> [DEFAULT]
> bind_ip = 192.168.1.206
> workers = 1
> 
> [pipeline:main]
> pipeline = object-server
> 
> [app:object-server]
> use = egg:swift#object
> 
> [object-replicator]
> 
> [object-updater]
> 
> [object-auditor]
> 
> [object-server]
> set log_level = DEBUG
> 
> My general question is: how do I troubleshoot this?   I've tried turning on debug logging on the Object Server, but I don't get any additional clues.
> 
> The Object-servers do give a 507 error, which might indicate a disk problem, but there is nothing wrong with the storage drive.   And also if there was a fundamental drive problem then I wouldn't be able to upload objects in the first place.
> 
> My best guess at the moment is that I should adjust the configuration of the object servers to reflect their lower performance.   What would I adjust, the cache size?   I'm not seeing any log messages that would indicate a time-out.   
> 
> Thanks in advance for your advice.
> 
> -N
> _______________________________________________
> Mailing list: https://launchpad.net/~openstack
> Post to     : openstack at lists.launchpad.net
> Unsubscribe : https://launchpad.net/~openstack
> More help   : https://help.launchpad.net/ListHelp
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20121027/9d9272ca/attachment.html>


More information about the Openstack mailing list