[Swift] Object replication failures on newly upgraded servers

Mark Kirkwood mark.kirkwood at catalyst.net.nz
Wed Jun 30 01:32:54 UTC 2021


On 3/06/21 6:22 pm, Pete Zaitcev wrote:

> On Fri, 28 May 2021 16:58:10 +1200
> Mark Kirkwood <mark.kirkwood at catalyst.net.nz> wrote:
>
>> Examining the logs (/var/log/swift/object.log and /var/log/syslog) these
>> are not throwing up any red flags (i.e no failing rsyncs noted).
> You should be seeing tracebacks and "Error syncing partition",
> "Error syncing handoff partition", or "Exception in top-level
> replication loop".
>

Thanks Pete!

Debugging during the upgrade was tricky as there were clearly errors 
being caused when each storage node was down being rebuilt. However the 
upgrade process is now complete, so I'm looking at this more closely.

Picking on 1 storage node I do see a reasonable number (63 in the last 5 
days) of:

Jun 30 04:07:29 cat-hlz-ostor003 object-server: Error syncing with node: 
{'index': 2, u'replication_port': 6000, u'weight': 6.0, u'zone': 10, 
u'ip': u'x.x.x.x', u'region': 10, u'id': 18, u'replication_ip': 
u'x.x.x.x', u'meta': u'', u'device': u'obj03', u'port': 6000}: Timeout (60s)

So this looks like the source (or at least *one* source) of the issue - 
also why I'm not seeing any failing rsyncs (as we are not getting that far).


Also seeing a small number (1 in the last 5 days) of:

Jun 30 06:40:34 cat-hlz-ostor003 object-server: Error syncing partition: 
LockTimeout (10s) /srv/node/obj06/objects-20/544096/.lock


So, I need to figure out why we are timing out!


regards

Mark




More information about the openstack-discuss mailing list