[Openstack] Replication error
Mike Preston
mike.preston at synety.com
Thu Sep 26 07:59:44 UTC 2013
I know it is poor form to reply to yourself, but I would appreciate it if anyone has any insight on this problem.
Mike Preston
Infrastructure Team | SYNETY
www.synety.com<http://www.synety.com>
direct: 0116 424 4016
mobile: 07950 892038
main: 0116 424 4000
From: Mike Preston [mailto:mike.preston at synety.com]
Sent: 24 September 2013 09:52
To: openstack at lists.openstack.org
Subject: Re: [Openstack] Replication error
root at storage-proxy-01:~/swift# swift-ring-builder object.builder validate
root at storage-proxy-01:~/swift# echo $?
0
I ran md5sum on the ring files on both the proxy (where we generate them) and the nodes and confirmed that they are identical.
root at storage-proxy-01:~/swift# swift-ring-builder object.builder
object.builder, build version 72
65536 partitions, 3 replicas, 4 zones, 32 devices, 999.99 balance
The minimum number of hours before a partition can be reassigned is 3
Devices: id zone ip address port name weight partitions balance meta
0 1 10.20.15.51 6000 sdb1 3000.00 7123 1.44
1 1 10.20.15.51 6000 sdc1 3000.00 7123 1.44
2 1 10.20.15.51 6000 sdd1 3000.00 7122 1.43
3 1 10.20.15.51 6000 sde1 3000.00 7123 1.44
4 1 10.20.15.51 6000 sdf1 3000.00 7122 1.43
5 1 10.20.15.51 6000 sdg1 3000.00 7123 1.44
6 3 10.20.15.51 6000 sdh1 0.00 1273 999.99
7 3 10.20.15.51 6000 sdi1 0.00 1476 999.99
8 2 10.20.15.52 6000 sdb1 3000.00 7122 1.43
9 2 10.20.15.52 6000 sdc1 3000.00 7122 1.43
10 2 10.20.15.52 6000 sdd1 3000.00 7122 1.43
11 2 10.20.15.52 6000 sde1 3000.00 7122 1.43
12 2 10.20.15.52 6000 sdf1 3000.00 7122 1.43
13 2 10.20.15.52 6000 sdg1 3000.00 7122 1.43
14 3 10.20.15.52 6000 sdh1 0.00 1378 999.99
15 3 10.20.15.52 6000 sdi1 0.00 997 999.99
16 3 10.20.15.53 6000 sas0 3000.00 6130 -12.70
17 3 10.20.15.53 6000 sas1 3000.00 6130 -12.70
18 3 10.20.15.53 6000 sas2 3000.00 6129 -12.71
19 3 10.20.15.53 6000 sas3 3000.00 6130 -12.70
20 3 10.20.15.53 6000 sas4 3000.00 6130 -12.70
21 3 10.20.15.53 6000 sas5 3000.00 6130 -12.70
22 3 10.20.15.53 6000 sas6 3000.00 6129 -12.71
23 3 10.20.15.53 6000 sas7 3000.00 6129 -12.71
24 4 10.20.15.54 6000 sas0 3000.00 7122 1.43
25 4 10.20.15.54 6000 sas1 3000.00 7122 1.43
26 4 10.20.15.54 6000 sas2 3000.00 7123 1.44
27 4 10.20.15.54 6000 sas3 3000.00 7123 1.44
28 4 10.20.15.54 6000 sas4 3000.00 7122 1.43
29 4 10.20.15.54 6000 sas5 3000.00 7122 1.43
30 4 10.20.15.54 6000 sas6 3000.00 7123 1.44
31 4 10.20.15.54 6000 sas7 3000.00 7122 1.43
(We are currently migrating data between boxes due to cluster hardware replacement, which is why zone 3 is weighted as such on the first 2 nodes)
Filelist attached (for the objects/ directory on the devices)...
but I see nothing out of place.
I'll run a full fsck on the drives tonight, try to rule that out.
Thanks for your help.
Mike Preston
Infrastructure Team | SYNETY
www.synety.com<http://www.synety.com>
direct: 0116 424 4016
mobile: 07950 892038
main: 0116 424 4000
From: Clay Gerrard [mailto:clay.gerrard at gmail.com]
Sent: 23 September 2013 20:34
To: Mike Preston
Cc: openstack at lists.openstack.org<mailto:openstack at lists.openstack.org>
Subject: Re: [Openstack] Replication error
Run `swift-ring-builder /etc/swift/object.builder validate` - it should have no errors and exit 0. Can you provide a paste of the output from `swift-ring-builder /etc/swift/object.builder` as well - it should list some general info about the ring (number of replicas, and list of devices). Rebalance the ring and make sure it's been distributed to all nodes.
The particular line you're seeing pop up in the traceback seems to be looking for all of the nodes for a particular partition it found in the objects' dir. I'm not seeing any local sanitization [1] around those top level directory names, so maybe it's just some garbage that created there outside of swift, or some file system corruption?
Can you provide the output from `ls /srv/node/objects` (or wherever you have devices configured)
-Clay
1. https://bugs.launchpad.net/swift/+bug/1229372
On Mon, Sep 23, 2013 at 2:34 AM, Mike Preston <mike.preston at synety.com<mailto:mike.preston at synety.com>> wrote:
Hi,
We are seeing a replication error on swift. The error only is seen on a single node, the other nodes appear to be working fine.
Installed version is debian wheezy with swift 1.4.8-2+deb7u1
Sep 23 10:33:03 storage-node-01 object-replicator Starting object replication pass.
Sep 23 10:33:03 storage-node-01 object-replicator Exception in top-level replication loop: #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 564, in replicate#012 jobs = self.collect_jobs()#012 File "/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 536, in collect_jobs#012 self.object_ring.get_part_nodes(int(partition))#012 File "/usr/lib/python2.7/dist-packages/swift/common/ring/ring.py", line 103, in get_part_nodes#012 return [self.devs[r[part]] for r in self._replica2part2dev_id]#012IndexError: array index out of range
Sep 23 10:33:03 storage-node-01 object-replicator Nothing replicated for 0.728466033936 seconds.
Sep 23 10:33:03 storage-node-01 object-replicator Object replication complete. (0.01 minutes)
Can anyone shed any light on this or next steps in debugging it or fixing it?
Mike Preston
Infrastructure Team | SYNETY
www.synety.com<http://www.synety.com>
direct: 0116 424 4016
mobile: 07950 892038
main: 0116 424 4000
_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack at lists.openstack.org<mailto:openstack at lists.openstack.org>
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20130926/c9feecc3/attachment.html>
More information about the Openstack
mailing list