<div dir="auto"><div>Thanks for the postmortem; it's always a good read tp learn stuff :)<br><div class="gmail_extra"><br><div class="gmail_quote">On 28 Oct 2017 00:11, "Ben Nemec" <<a href="mailto:openstack@nemebean.com">openstack@nemebean.com</a>> wrote:<br type="attribution"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>

<br>

As you may or may not have noticed all ovb jobs on rh1 started failing sometime last night.  After some investigation today I found a few issues.<br>

<br>

First, our nova db archiving wasn't working.  This was due to the auto-increment counter issue described by melwitt in <a href="http://lists.openstack.org/pipermail/openstack-dev/2017-September/122903.html" rel="noreferrer" target="_blank">http://lists.openstack.org/pip<wbr>ermail/openstack-dev/2017-Sept<wbr>ember/122903.html</a>  Deleting the problematic rows from the shadow table got us past that.<br>

<br>

On another db-related note, we seem to have turned ceilometer back on at some point in rh1.  I think that was intentional to avoid notification queues backing up, but it led to a different problem.  We had approximately 400 GB of mongodb data from ceilometer that we don't actually care about.  I cleaned that up and set a TTL in ceilometer so hopefully this won't happen again.<br></blockquote></div></div></div><div dir="auto">Is there an alarm or something we could set to get notified about this kind of stuff? Or better yet, something we could automate to avoid this? What's usimg mongodb nowadays?</div><div dir="auto"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Unfortunately neither of these things completely resolved the extreme slowness in the cloud that was causing every testenv to fail.  After trying a number of things that made no difference, the culprit seems to have been rabbitmq.  There was nothing obviously wrong with it according to the web interface, the queues were all short and messages seemed to be getting delivered.  However, when I ran rabbitmqctl status at the CLI it reported that the node was down.  Since something was clearly wrong I went ahead and restarted it.  After that everything seems to be back to normal.<br></blockquote></div></div></div><div dir="auto">Same questiom as above, could we set and alarm or automate the node recovery?</div><div dir="auto"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

I'm not sure exactly what the cause of all this was.  We did get kind of inundated with jobs yesterday after a zuul restart which I think is what probably pushed us over the edge, but that has happened before without bringing the cloud down.  It was probably a combination of some previously unnoticed issues stacking up over time and the large number of testenvs requested all at once.<br>

<br>

In any case, testenvs are creating successfully again and the jobs in the queue look good so far.  If you notice any problems please let me know though.  I'm hoping this will help with the job timeouts, but that remains to be seen.<br>

<br>

-Ben<br>

<br>

______________________________<wbr>______________________________<wbr>______________<br>

OpenStack Development Mailing List (not for usage questions)<br>

Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.op<wbr>enstack.org?subject:unsubscrib<wbr>e</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi<wbr>-bin/mailman/listinfo/openstac<wbr>k-dev</a><br>

</blockquote></div><br></div></div></div>