[openstack-dev] [tripleo] rh1 outage today

Ben Nemec openstack at nemebean.com
Mon Oct 30 22:21:30 UTC 2017



On 10/30/2017 05:14 PM, Ben Nemec wrote:
> It turns out this wasn't _quite_ resolved yet.  I was still seeing some 
> excessively long stack creation times today and it turns out one of our 
> compute nodes had virtualization turned off.  This caused all of its 
> instances to fail and need a retry.  Once I disabled the compute service 
> on it stacks seemed to be creating in a normal amount of time again.
> 
> This happened because the node had some hardware issues, and apparently 
> the fix was to replace the system board so we got it back with 
> everything set to default.  I fixed this and re-enabled the node and all 
> seems well again.
> 
> On 10/28/2017 02:07 AM, Juan Antonio Osorio wrote:
>> Thanks for the postmortem; it's always a good read tp learn stuff :)
>>
>> On 28 Oct 2017 00:11, "Ben Nemec" <openstack at nemebean.com 
>> <mailto:openstack at nemebean.com>> wrote:
>>
>>     Hi all,
>>
>>     As you may or may not have noticed all ovb jobs on rh1 started
>>     failing sometime last night.  After some investigation today I found
>>     a few issues.
>>
>>     First, our nova db archiving wasn't working.  This was due to the
>>     auto-increment counter issue described by melwitt in
>>     
>> http://lists.openstack.org/pipermail/openstack-dev/2017-September/122903.html 
>>
>>     
>> <http://lists.openstack.org/pipermail/openstack-dev/2017-September/122903.html> 
>>     Deleting the problematic rows from the shadow table got us past that.
>>
>>     On another db-related note, we seem to have turned ceilometer back
>>     on at some point in rh1.  I think that was intentional to avoid
>>     notification queues backing up, but it led to a different problem. 
>>     We had approximately 400 GB of mongodb data from ceilometer that we
>>     don't actually care about.  I cleaned that up and set a TTL in
>>     ceilometer so hopefully this won't happen again.
>>
>> Is there an alarm or something we could set to get notified about this 
>> kind of stuff? Or better yet, something we could automate to avoid 
>> this? What's usimg mongodb nowadays?
> 
> Setting a TTL should avoid this in the future.  Note that I don't think 
> mongo is still used by default, but in our old Mitaka version it was.
> 
> For the nova archiving thing I think we'd have to set up email 
> notifications for failed cron jobs.  That would be a good RFE.

And done: https://bugs.launchpad.net/tripleo/+bug/1728737

> 
>>
>>
>>     Unfortunately neither of these things completely resolved the
>>     extreme slowness in the cloud that was causing every testenv to
>>     fail.  After trying a number of things that made no difference, the
>>     culprit seems to have been rabbitmq.  There was nothing obviously
>>     wrong with it according to the web interface, the queues were all
>>     short and messages seemed to be getting delivered.  However, when I
>>     ran rabbitmqctl status at the CLI it reported that the node was
>>     down.  Since something was clearly wrong I went ahead and restarted
>>     it.  After that everything seems to be back to normal.
>>
>> Same questiom as above, could we set and alarm or automate the node 
>> recovery?
> 
> On this one I have no idea.  As I noted, when I looked at the rabbit web 
> ui everything looked fine.  This isn't like the notification queue 
> problem where one look at the queue lengths made it obvious something 
> was wrong.  Messages were being delivered successfully, just very, very 
> slowly.  Maybe looking at messages per second would help, but that would 
> be hard to automate.  You'd have to know if there were few messages 
> going through because of performance issues or if the cloud is just 
> under light load.
> 
> I guess it's also worth noting that at some point this cloud is going 
> away in favor of RDO cloud.  Of course we said that back in December 
> when we discussed the OVS port exhaustion issue and now 11 months later 
> it still hasn't happened.  That's why I haven't been too inclined to 
> pursue extensive monitoring for the existing cloud though.
> 
>>
>>
>>     I'm not sure exactly what the cause of all this was.  We did get
>>     kind of inundated with jobs yesterday after a zuul restart which I
>>     think is what probably pushed us over the edge, but that has
>>     happened before without bringing the cloud down.  It was probably a
>>     combination of some previously unnoticed issues stacking up over
>>     time and the large number of testenvs requested all at once.
>>
>>     In any case, testenvs are creating successfully again and the jobs
>>     in the queue look good so far.  If you notice any problems please
>>     let me know though.  I'm hoping this will help with the job
>>     timeouts, but that remains to be seen.
>>
>>     -Ben
>>
>>     
>> __________________________________________________________________________ 
>>
>>     OpenStack Development Mailing List (not for usage questions)
>>     Unsubscribe:
>>     OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>     
>> <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
>>     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>     <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
>>
>>
>>
>>
>> __________________________________________________________________________ 
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: 
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list