We track and prominently display the time since the last replication cycle completed some minutes after a ring was deployed (the raw data is available in recon data [1]) and also monitor counts of handoff partitions per device (aggregated per node and cluster wide) [2]. You could also try to confirm you can observe the dreaded "Lockup detected.. killing live coros" message [3] and perhaps take some operational action based on that... -Clay 1. http://docs.openstack.org/developer/swift/admin_guide.html#cluster-telemetry-and-monitoring 2. basically look on disk, compare to ring, say which ones are handoffs - sample that every so often. The "compare to ring, say which ones are handoffs" part looks basically like this https://gist.github.com/clayg/90143abc1c34e259752bf333f485a37e - the "look on disk" and "sample that every so often" don't currently have prescriptive implementations I can refer you to 3. https://bugs.launchpad.net/swift/+bug/1575277 On Mon, Jan 16, 2017 at 5:11 PM, Mark Kirkwood < mark.kirkwood at catalyst.net.nz> wrote: > Hi, > > We suffered a hung object replicator recently. In the process of sorting > that out some question came to mind: > > 1/ Reliably determining if a replicator has hung (or just has nothing to > do) > > 2/ Determining how behind replication is > > > Now the output of swift-recon combined with the dispersion report > certainly *suggest* that (say in case 1) there is work to do but nothing is > happening. However is there a known way to determine that 'ok chaps the > replicator has hung...'? > > > Along the same lines the next question I'm being asked is about 2/ 'How > behind/how much work is left for the replicator'? From previous reading of > the code it looks like the replicator creates jobs (each of which is a > partition + a set of suffixes) - so is there a way to poke the daemon and > ask something like 'how many jobs do you have to go this run'? > > > regards > > > Mark > > > _______________________________________________ > Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstac > k > Post to : openstack at lists.openstack.org > Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstac > k > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.openstack.org/pipermail/openstack/attachments/20170116/460da709/attachment.html>