Re: [Swift] Object replication failures on newly upgraded servers

30 Jun 2021

      On 3/06/21 6:22 pm, Pete Zaitcev wrote:
...
On Fri, 28 May 2021 16:58:10 +1200
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
...
Examining the logs (/var/log/swift/object.log and /var/log/syslog) these
are not throwing up any red flags (i.e no failing rsyncs noted).
You should be seeing tracebacks and "Error syncing partition",
"Error syncing handoff partition", or "Exception in top-level
replication loop".
Thanks Pete!

Debugging during the upgrade was tricky as there were clearly errors 
being caused when each storage node was down being rebuilt. However the 
upgrade process is now complete, so I'm looking at this more closely.

Picking on 1 storage node I do see a reasonable number (63 in the last 5 
days) of:

Jun 30 04:07:29 cat-hlz-ostor003 object-server: Error syncing with node: 
{'index': 2, u'replication_port': 6000, u'weight': 6.0, u'zone': 10, 
u'ip': u'x.x.x.x', u'region': 10, u'id': 18, u'replication_ip': 
u'x.x.x.x', u'meta': u'', u'device': u'obj03', u'port': 6000}: Timeout (60s)

So this looks like the source (or at least *one* source) of the issue - 
also why I'm not seeing any failing rsyncs (as we are not getting that far).

Also seeing a small number (1 in the last 5 days) of:

Jun 30 06:40:34 cat-hlz-ostor003 object-server: Error syncing partition: 
LockTimeout (10s) /srv/node/obj06/objects-20/544096/.lock

So, I need to figure out why we are timing out!

regards

Mark