[openstack-dev] [tripleo] rh1 outage today
Ben Nemec
openstack at nemebean.com
Mon Oct 30 22:21:30 UTC 2017
On 10/30/2017 05:14 PM, Ben Nemec wrote:
> It turns out this wasn't _quite_ resolved yet. I was still seeing some
> excessively long stack creation times today and it turns out one of our
> compute nodes had virtualization turned off. This caused all of its
> instances to fail and need a retry. Once I disabled the compute service
> on it stacks seemed to be creating in a normal amount of time again.
>
> This happened because the node had some hardware issues, and apparently
> the fix was to replace the system board so we got it back with
> everything set to default. I fixed this and re-enabled the node and all
> seems well again.
>
> On 10/28/2017 02:07 AM, Juan Antonio Osorio wrote:
>> Thanks for the postmortem; it's always a good read tp learn stuff :)
>>
>> On 28 Oct 2017 00:11, "Ben Nemec" <openstack at nemebean.com
>> <mailto:openstack at nemebean.com>> wrote:
>>
>> Hi all,
>>
>> As you may or may not have noticed all ovb jobs on rh1 started
>> failing sometime last night. After some investigation today I found
>> a few issues.
>>
>> First, our nova db archiving wasn't working. This was due to the
>> auto-increment counter issue described by melwitt in
>>
>> http://lists.openstack.org/pipermail/openstack-dev/2017-September/122903.html
>>
>>
>> <http://lists.openstack.org/pipermail/openstack-dev/2017-September/122903.html>
>> Deleting the problematic rows from the shadow table got us past that.
>>
>> On another db-related note, we seem to have turned ceilometer back
>> on at some point in rh1. I think that was intentional to avoid
>> notification queues backing up, but it led to a different problem.
>> We had approximately 400 GB of mongodb data from ceilometer that we
>> don't actually care about. I cleaned that up and set a TTL in
>> ceilometer so hopefully this won't happen again.
>>
>> Is there an alarm or something we could set to get notified about this
>> kind of stuff? Or better yet, something we could automate to avoid
>> this? What's usimg mongodb nowadays?
>
> Setting a TTL should avoid this in the future. Note that I don't think
> mongo is still used by default, but in our old Mitaka version it was.
>
> For the nova archiving thing I think we'd have to set up email
> notifications for failed cron jobs. That would be a good RFE.
And done: https://bugs.launchpad.net/tripleo/+bug/1728737
>
>>
>>
>> Unfortunately neither of these things completely resolved the
>> extreme slowness in the cloud that was causing every testenv to
>> fail. After trying a number of things that made no difference, the
>> culprit seems to have been rabbitmq. There was nothing obviously
>> wrong with it according to the web interface, the queues were all
>> short and messages seemed to be getting delivered. However, when I
>> ran rabbitmqctl status at the CLI it reported that the node was
>> down. Since something was clearly wrong I went ahead and restarted
>> it. After that everything seems to be back to normal.
>>
>> Same questiom as above, could we set and alarm or automate the node
>> recovery?
>
> On this one I have no idea. As I noted, when I looked at the rabbit web
> ui everything looked fine. This isn't like the notification queue
> problem where one look at the queue lengths made it obvious something
> was wrong. Messages were being delivered successfully, just very, very
> slowly. Maybe looking at messages per second would help, but that would
> be hard to automate. You'd have to know if there were few messages
> going through because of performance issues or if the cloud is just
> under light load.
>
> I guess it's also worth noting that at some point this cloud is going
> away in favor of RDO cloud. Of course we said that back in December
> when we discussed the OVS port exhaustion issue and now 11 months later
> it still hasn't happened. That's why I haven't been too inclined to
> pursue extensive monitoring for the existing cloud though.
>
>>
>>
>> I'm not sure exactly what the cause of all this was. We did get
>> kind of inundated with jobs yesterday after a zuul restart which I
>> think is what probably pushed us over the edge, but that has
>> happened before without bringing the cloud down. It was probably a
>> combination of some previously unnoticed issues stacking up over
>> time and the large number of testenvs requested all at once.
>>
>> In any case, testenvs are creating successfully again and the jobs
>> in the queue look good so far. If you notice any problems please
>> let me know though. I'm hoping this will help with the job
>> timeouts, but that remains to be seen.
>>
>> -Ben
>>
>>
>> __________________________________________________________________________
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>
>> <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
>>
>>
>>
>>
>> __________________________________________________________________________
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list