On 3/06/21 6:22 pm, Pete Zaitcev wrote:
On Fri, 28 May 2021 16:58:10 +1200 Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
Examining the logs (/var/log/swift/object.log and /var/log/syslog) these are not throwing up any red flags (i.e no failing rsyncs noted). You should be seeing tracebacks and "Error syncing partition", "Error syncing handoff partition", or "Exception in top-level replication loop".
Thanks Pete! Debugging during the upgrade was tricky as there were clearly errors being caused when each storage node was down being rebuilt. However the upgrade process is now complete, so I'm looking at this more closely. Picking on 1 storage node I do see a reasonable number (63 in the last 5 days) of: Jun 30 04:07:29 cat-hlz-ostor003 object-server: Error syncing with node: {'index': 2, u'replication_port': 6000, u'weight': 6.0, u'zone': 10, u'ip': u'x.x.x.x', u'region': 10, u'id': 18, u'replication_ip': u'x.x.x.x', u'meta': u'', u'device': u'obj03', u'port': 6000}: Timeout (60s) So this looks like the source (or at least *one* source) of the issue - also why I'm not seeing any failing rsyncs (as we are not getting that far). Also seeing a small number (1 in the last 5 days) of: Jun 30 06:40:34 cat-hlz-ostor003 object-server: Error syncing partition: LockTimeout (10s) /srv/node/obj06/objects-20/544096/.lock So, I need to figure out why we are timing out! regards Mark