[openstack-dev] [tripleo] rh1 outage today

Ben Nemec openstack at nemebean.com
Mon Oct 30 22:14:44 UTC 2017


It turns out this wasn't _quite_ resolved yet.  I was still seeing some 
excessively long stack creation times today and it turns out one of our 
compute nodes had virtualization turned off.  This caused all of its 
instances to fail and need a retry.  Once I disabled the compute service 
on it stacks seemed to be creating in a normal amount of time again.

This happened because the node had some hardware issues, and apparently 
the fix was to replace the system board so we got it back with 
everything set to default.  I fixed this and re-enabled the node and all 
seems well again.

On 10/28/2017 02:07 AM, Juan Antonio Osorio wrote:
> Thanks for the postmortem; it's always a good read tp learn stuff :)
> 
> On 28 Oct 2017 00:11, "Ben Nemec" <openstack at nemebean.com 
> <mailto:openstack at nemebean.com>> wrote:
> 
>     Hi all,
> 
>     As you may or may not have noticed all ovb jobs on rh1 started
>     failing sometime last night.  After some investigation today I found
>     a few issues.
> 
>     First, our nova db archiving wasn't working.  This was due to the
>     auto-increment counter issue described by melwitt in
>     http://lists.openstack.org/pipermail/openstack-dev/2017-September/122903.html
>     <http://lists.openstack.org/pipermail/openstack-dev/2017-September/122903.html> 
>     Deleting the problematic rows from the shadow table got us past that.
> 
>     On another db-related note, we seem to have turned ceilometer back
>     on at some point in rh1.  I think that was intentional to avoid
>     notification queues backing up, but it led to a different problem. 
>     We had approximately 400 GB of mongodb data from ceilometer that we
>     don't actually care about.  I cleaned that up and set a TTL in
>     ceilometer so hopefully this won't happen again.
> 
> Is there an alarm or something we could set to get notified about this 
> kind of stuff? Or better yet, something we could automate to avoid this? 
> What's usimg mongodb nowadays?

Setting a TTL should avoid this in the future.  Note that I don't think 
mongo is still used by default, but in our old Mitaka version it was.

For the nova archiving thing I think we'd have to set up email 
notifications for failed cron jobs.  That would be a good RFE.

> 
> 
>     Unfortunately neither of these things completely resolved the
>     extreme slowness in the cloud that was causing every testenv to
>     fail.  After trying a number of things that made no difference, the
>     culprit seems to have been rabbitmq.  There was nothing obviously
>     wrong with it according to the web interface, the queues were all
>     short and messages seemed to be getting delivered.  However, when I
>     ran rabbitmqctl status at the CLI it reported that the node was
>     down.  Since something was clearly wrong I went ahead and restarted
>     it.  After that everything seems to be back to normal.
> 
> Same questiom as above, could we set and alarm or automate the node 
> recovery?

On this one I have no idea.  As I noted, when I looked at the rabbit web 
ui everything looked fine.  This isn't like the notification queue 
problem where one look at the queue lengths made it obvious something 
was wrong.  Messages were being delivered successfully, just very, very 
slowly.  Maybe looking at messages per second would help, but that would 
be hard to automate.  You'd have to know if there were few messages 
going through because of performance issues or if the cloud is just 
under light load.

I guess it's also worth noting that at some point this cloud is going 
away in favor of RDO cloud.  Of course we said that back in December 
when we discussed the OVS port exhaustion issue and now 11 months later 
it still hasn't happened.  That's why I haven't been too inclined to 
pursue extensive monitoring for the existing cloud though.

> 
> 
>     I'm not sure exactly what the cause of all this was.  We did get
>     kind of inundated with jobs yesterday after a zuul restart which I
>     think is what probably pushed us over the edge, but that has
>     happened before without bringing the cloud down.  It was probably a
>     combination of some previously unnoticed issues stacking up over
>     time and the large number of testenvs requested all at once.
> 
>     In any case, testenvs are creating successfully again and the jobs
>     in the queue look good so far.  If you notice any problems please
>     let me know though.  I'm hoping this will help with the job
>     timeouts, but that remains to be seen.
> 
>     -Ben
> 
>     __________________________________________________________________________
>     OpenStack Development Mailing List (not for usage questions)
>     Unsubscribe:
>     OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>     <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
>     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>     <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
> 
> 
> 
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 



More information about the OpenStack-dev mailing list