<div dir="ltr">We track and prominently display the time since the last replication cycle completed some minutes after a ring was deployed (the raw data is available in recon data [1]) and also monitor counts of handoff partitions per device (aggregated per node and cluster wide) [2].<div><br></div><div>You could also try to confirm you can observe the dreaded "Lockup detected.. killing live coros" message [3] and perhaps take some operational action based on that...</div><div><br></div><div><div>-Clay</div></div><div><br></div><div>1. <a href="http://docs.openstack.org/developer/swift/admin_guide.html#cluster-telemetry-and-monitoring">http://docs.openstack.org/developer/swift/admin_guide.html#cluster-telemetry-and-monitoring</a><br></div><div>2. basically look on disk, compare to ring, say which ones are handoffs - sample that every so often.  The "compare to ring, say which ones are handoffs" part looks basically like this <a href="https://gist.github.com/clayg/90143abc1c34e259752bf333f485a37e">https://gist.github.com/clayg/90143abc1c34e259752bf333f485a37e</a> - the "look on disk" and "sample that every so often" don't currently have prescriptive implementations I can refer you to</div><div><div>3. <a href="https://bugs.launchpad.net/swift/+bug/1575277">https://bugs.launchpad.net/swift/+bug/1575277</a> </div></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Jan 16, 2017 at 5:11 PM, Mark Kirkwood <span dir="ltr"><<a href="mailto:mark.kirkwood@catalyst.net.nz" target="_blank">mark.kirkwood@catalyst.net.nz</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

We suffered a hung object replicator recently. In the process of sorting that out some question came to mind:<br>

<br>

1/ Reliably determining if a replicator has hung (or just has nothing to do)<br>

<br>

2/ Determining how behind replication is<br>

<br>

<br>

Now the output of swift-recon combined with the dispersion report certainly *suggest* that (say in case 1) there is work to do but nothing is happening. However is there a known way to determine that 'ok chaps the replicator has hung...'?<br>

<br>

<br>

Along the same lines the next question I'm being asked is about 2/ 'How behind/how much work is left for the replicator'? From previous reading of the code it looks like the replicator creates jobs (each of which is a partition + a set of suffixes) - so is there a way to poke the daemon and ask something like 'how many jobs do you have to go this run'?<br>

<br>

<br>

regards<br>

<br>

<br>

Mark<br>

<br>

<br>

______________________________<wbr>_________________<br>

Mailing list: <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi<wbr>-bin/mailman/listinfo/openstac<wbr>k</a><br>

Post to     : <a href="mailto:openstack@lists.openstack.org" target="_blank">openstack@lists.openstack.org</a><br>

Unsubscribe : <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi<wbr>-bin/mailman/listinfo/openstac<wbr>k</a><br>

</blockquote></div><br></div>