[Swift] Object replication failures on newly upgraded servers
HI,
I'm in the process of upgrading a Swift cluster from 2.7/Mitaka to 2.23/Train. While in general it seems to be going well, I'm noticing non-zero object replication failures on the upgraded nodes only, e.g:
$ curl http://localhost:6000/recon/replication/object {"replication_last": 1622156911.019487, "replication_stats": {"rsync": 40580, "success": 4141229, "attempted": 2081856, "remove": 4083, "suffix_count": 14960481, "failure": 26550, "hashmatch": 4127197, "failure_nodes": {"10.11.18.67": {"obj08": 2348, "obj09": 60, "obj10": 3030, "obj02": 34, "obj03": 25, "obj01": 44, "obj06": 1498, "obj07": 28, "obj04": 69, "obj05": 36}, "10.11.18.68": {"obj03": 6901, "obj01": 293, "obj06": 1901, "obj04": 10281, "obj10": 1}, "10.12.18.76": {"obj10": 1}}, "suffix_sync": 1785, "suffix_hash": 2778}, "object_replication_last": 1622156911.019487, "replication_time": 1094.7836411476135, "object_replication_time": 1094.7836411476135}
Examining the logs (/var/log/swift/object.log and /var/log/syslog) these are not throwing up any red flags (i.e no failing rsyncs noted). Any suggesting about how to get more information about what went wrong e.g: "10.11.18.67": {"obj08": 2348}, how to find what those 2348 errors were?
regards
Mark
P.s: basic sanity checking is ok - uploaded objects go where they should and can be retrieved for 2.7 or 2.23 servers ok (the old and new version servers agree about object placement)
On Fri, 28 May 2021 16:58:10 +1200 Mark Kirkwood mark.kirkwood@catalyst.net.nz wrote:
Examining the logs (/var/log/swift/object.log and /var/log/syslog) these are not throwing up any red flags (i.e no failing rsyncs noted).
You should be seeing tracebacks and "Error syncing partition", "Error syncing handoff partition", or "Exception in top-level replication loop".
-- Pete
On 3/06/21 6:22 pm, Pete Zaitcev wrote:
On Fri, 28 May 2021 16:58:10 +1200 Mark Kirkwood mark.kirkwood@catalyst.net.nz wrote:
Examining the logs (/var/log/swift/object.log and /var/log/syslog) these are not throwing up any red flags (i.e no failing rsyncs noted).
You should be seeing tracebacks and "Error syncing partition", "Error syncing handoff partition", or "Exception in top-level replication loop".
Thanks Pete!
Debugging during the upgrade was tricky as there were clearly errors being caused when each storage node was down being rebuilt. However the upgrade process is now complete, so I'm looking at this more closely.
Picking on 1 storage node I do see a reasonable number (63 in the last 5 days) of:
Jun 30 04:07:29 cat-hlz-ostor003 object-server: Error syncing with node: {'index': 2, u'replication_port': 6000, u'weight': 6.0, u'zone': 10, u'ip': u'x.x.x.x', u'region': 10, u'id': 18, u'replication_ip': u'x.x.x.x', u'meta': u'', u'device': u'obj03', u'port': 6000}: Timeout (60s)
So this looks like the source (or at least *one* source) of the issue - also why I'm not seeing any failing rsyncs (as we are not getting that far).
Also seeing a small number (1 in the last 5 days) of:
Jun 30 06:40:34 cat-hlz-ostor003 object-server: Error syncing partition: LockTimeout (10s) /srv/node/obj06/objects-20/544096/.lock
So, I need to figure out why we are timing out!
regards
Mark
On Wed, 30 Jun 2021 13:32:54 +1200 Mark Kirkwood mark.kirkwood@catalyst.net.nz wrote:
Jun 30 06:40:34 cat-hlz-ostor003 object-server: Error syncing partition: LockTimeout (10s) /srv/node/obj06/objects-20/544096/.lock
Why do you have 20 policies? Sounds rather unusual.
So, I need to figure out why we are timing out!
Sorry, I don't have enough operator experience with this. In my case it's just not enough workers for the number of the nodes, but I'm sure your setup is more complex.
-- Pete
On 30/06/21 4:11 pm, Pete Zaitcev wrote:
On Wed, 30 Jun 2021 13:32:54 +1200 Mark Kirkwood mark.kirkwood@catalyst.net.nz wrote:
Jun 30 06:40:34 cat-hlz-ostor003 object-server: Error syncing partition: LockTimeout (10s) /srv/node/obj06/objects-20/544096/.lock
Why do you have 20 policies? Sounds rather unusual.
So, I need to figure out why we are timing out!
Sorry, I don't have enough operator experience with this. In my case it's just not enough workers for the number of the nodes, but I'm sure your setup is more complex.
Thanks Pete! No we only have 4 policies - but we numbered the additional ones to match the region numbers (10, 20, 30)!
Hah! I was just looking at the object-server worker count (e.g: 16 on a node with 32 cores) and thinking 'hmmm...' when I saw your message. Will experiment with increasing that.
Cheers
Mark
participants (2)
-
Mark Kirkwood
-
Pete Zaitcev