[Openstack] Replication error

Mike Preston mike.preston at synety.com
Thu Sep 26 07:59:44 UTC 2013


I know it is poor form to reply to yourself, but I would appreciate it if anyone has any insight on this problem.

Mike Preston
Infrastructure Team  |  SYNETY
www.synety.com<http://www.synety.com>

direct: 0116 424 4016
mobile: 07950 892038
main: 0116 424 4000


From: Mike Preston [mailto:mike.preston at synety.com]
Sent: 24 September 2013 09:52
To: openstack at lists.openstack.org
Subject: Re: [Openstack] Replication error

root at storage-proxy-01:~/swift# swift-ring-builder object.builder validate
root at storage-proxy-01:~/swift# echo $?
0

I ran md5sum on the ring files on both the proxy (where we generate them) and the nodes and confirmed that they are identical.

root at storage-proxy-01:~/swift# swift-ring-builder object.builder
object.builder, build version 72
65536 partitions, 3 replicas, 4 zones, 32 devices, 999.99 balance
The minimum number of hours before a partition can be reassigned is 3
Devices:    id  zone      ip address  port      name weight partitions balance meta
             0     1     10.20.15.51  6000      sdb1 3000.00       7123    1.44
             1     1     10.20.15.51  6000      sdc1 3000.00       7123    1.44
             2     1     10.20.15.51  6000      sdd1 3000.00       7122    1.43
             3     1     10.20.15.51  6000      sde1 3000.00       7123    1.44
             4     1     10.20.15.51  6000      sdf1 3000.00       7122    1.43
             5     1     10.20.15.51  6000      sdg1 3000.00       7123    1.44
             6     3     10.20.15.51  6000      sdh1   0.00       1273  999.99
             7     3     10.20.15.51  6000      sdi1   0.00       1476  999.99
             8     2     10.20.15.52  6000      sdb1 3000.00       7122    1.43
             9     2     10.20.15.52  6000      sdc1 3000.00       7122    1.43
            10     2     10.20.15.52  6000      sdd1 3000.00       7122    1.43
            11     2     10.20.15.52  6000      sde1 3000.00       7122    1.43
            12     2     10.20.15.52  6000      sdf1 3000.00       7122    1.43
            13     2     10.20.15.52  6000      sdg1 3000.00       7122    1.43
            14     3     10.20.15.52  6000      sdh1   0.00       1378  999.99
            15     3     10.20.15.52  6000      sdi1   0.00        997  999.99
            16     3     10.20.15.53  6000      sas0 3000.00       6130  -12.70
            17     3     10.20.15.53  6000      sas1 3000.00       6130  -12.70
            18     3     10.20.15.53  6000      sas2 3000.00       6129  -12.71
            19     3     10.20.15.53  6000      sas3 3000.00       6130  -12.70
            20     3     10.20.15.53  6000      sas4 3000.00       6130  -12.70
            21     3     10.20.15.53  6000      sas5 3000.00       6130  -12.70
            22     3     10.20.15.53  6000      sas6 3000.00       6129  -12.71
            23     3     10.20.15.53  6000      sas7 3000.00       6129  -12.71
            24     4     10.20.15.54  6000      sas0 3000.00       7122    1.43
            25     4     10.20.15.54  6000      sas1 3000.00       7122    1.43
            26     4     10.20.15.54  6000      sas2 3000.00       7123    1.44
            27     4     10.20.15.54  6000      sas3 3000.00       7123    1.44
            28     4     10.20.15.54  6000      sas4 3000.00       7122    1.43
            29     4     10.20.15.54  6000      sas5 3000.00       7122    1.43
            30     4     10.20.15.54  6000      sas6 3000.00       7123    1.44
            31     4     10.20.15.54  6000      sas7 3000.00       7122    1.43

(We are currently migrating data between boxes due to cluster hardware replacement, which is why zone 3 is weighted as such on the first 2 nodes)

Filelist attached (for the objects/ directory on the devices)...
but I see nothing out of place.

I'll run a full fsck on the drives tonight, try to rule that out.

Thanks for your help.



Mike Preston
Infrastructure Team  |  SYNETY
www.synety.com<http://www.synety.com>

direct: 0116 424 4016
mobile: 07950 892038
main: 0116 424 4000


From: Clay Gerrard [mailto:clay.gerrard at gmail.com]
Sent: 23 September 2013 20:34
To: Mike Preston
Cc: openstack at lists.openstack.org<mailto:openstack at lists.openstack.org>
Subject: Re: [Openstack] Replication error

Run `swift-ring-builder /etc/swift/object.builder validate` - it should have no errors and exit 0.  Can you provide a paste of the output from `swift-ring-builder /etc/swift/object.builder` as well - it should list some general info about the ring (number of replicas, and list of devices).  Rebalance the ring and make sure it's been distributed to all nodes.

The particular line you're seeing pop up in the traceback seems to be looking for all of the nodes for a particular partition it found in the objects' dir.  I'm not seeing any local sanitization [1] around those top level directory names, so maybe it's just some garbage that created there outside of swift, or some file system corruption?

Can you provide the output from `ls /srv/node/objects` (or wherever you have devices configured)

-Clay

1. https://bugs.launchpad.net/swift/+bug/1229372

On Mon, Sep 23, 2013 at 2:34 AM, Mike Preston <mike.preston at synety.com<mailto:mike.preston at synety.com>> wrote:
Hi,

We are seeing a replication error on swift. The error only is seen on a single node, the other nodes appear to be working fine.
Installed version is debian wheezy with swift 1.4.8-2+deb7u1
Sep 23 10:33:03 storage-node-01 object-replicator Starting object replication pass.
Sep 23 10:33:03 storage-node-01 object-replicator Exception in top-level replication loop: #012Traceback (most recent call last):#012  File "/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 564, in replicate#012    jobs = self.collect_jobs()#012  File "/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 536, in collect_jobs#012    self.object_ring.get_part_nodes(int(partition))#012  File "/usr/lib/python2.7/dist-packages/swift/common/ring/ring.py", line 103, in get_part_nodes#012    return [self.devs[r[part]] for r in self._replica2part2dev_id]#012IndexError: array index out of range
Sep 23 10:33:03 storage-node-01 object-replicator Nothing replicated for 0.728466033936 seconds.
Sep 23 10:33:03 storage-node-01 object-replicator Object replication complete. (0.01 minutes)
Can anyone shed any light on this or next steps in debugging it or fixing it?



Mike Preston
Infrastructure Team  |  SYNETY
www.synety.com<http://www.synety.com>

direct: 0116 424 4016
mobile: 07950 892038
main: 0116 424 4000



_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to     : openstack at lists.openstack.org<mailto:openstack at lists.openstack.org>
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20130926/c9feecc3/attachment.html>


More information about the Openstack mailing list