[openstack-dev] [tripleo] rh1 outage today

Ben Nemec openstack at nemebean.com
Fri Oct 27 21:11:13 UTC 2017


Hi all,

As you may or may not have noticed all ovb jobs on rh1 started failing 
sometime last night.  After some investigation today I found a few issues.

First, our nova db archiving wasn't working.  This was due to the 
auto-increment counter issue described by melwitt in 
http://lists.openstack.org/pipermail/openstack-dev/2017-September/122903.html 
  Deleting the problematic rows from the shadow table got us past that.

On another db-related note, we seem to have turned ceilometer back on at 
some point in rh1.  I think that was intentional to avoid notification 
queues backing up, but it led to a different problem.  We had 
approximately 400 GB of mongodb data from ceilometer that we don't 
actually care about.  I cleaned that up and set a TTL in ceilometer so 
hopefully this won't happen again.

Unfortunately neither of these things completely resolved the extreme 
slowness in the cloud that was causing every testenv to fail.  After 
trying a number of things that made no difference, the culprit seems to 
have been rabbitmq.  There was nothing obviously wrong with it according 
to the web interface, the queues were all short and messages seemed to 
be getting delivered.  However, when I ran rabbitmqctl status at the CLI 
it reported that the node was down.  Since something was clearly wrong I 
went ahead and restarted it.  After that everything seems to be back to 
normal.

I'm not sure exactly what the cause of all this was.  We did get kind of 
inundated with jobs yesterday after a zuul restart which I think is what 
probably pushed us over the edge, but that has happened before without 
bringing the cloud down.  It was probably a combination of some 
previously unnoticed issues stacking up over time and the large number 
of testenvs requested all at once.

In any case, testenvs are creating successfully again and the jobs in 
the queue look good so far.  If you notice any problems please let me 
know though.  I'm hoping this will help with the job timeouts, but that 
remains to be seen.

-Ben



More information about the OpenStack-dev mailing list