[openstack-dev] [tripleo] rh1 issues post-mortem

Ben Nemec openstack at nemebean.com
Tue Mar 28 22:01:34 UTC 2017


Final (hopefully) update:

All active compute nodes have been rebooted and things seem to be stable 
again.  Jobs are even running a little faster, so I'm thinking this had 
a detrimental effect on performance too.  I've set a reminder for about 
two months from now to reboot again if we're still using this environment.

On 03/24/2017 12:48 PM, Ben Nemec wrote:
> To follow-up on this, we've continued to hit this issue on other compute
> nodes.  Not surprising, of course.  They've all been up for about the
> same period of time and have had largely even workloads.
>
> It has caused problems though because it is cropping up faster than I
> can respond (it takes a few hours to cycle all the instances off a
> compute node, and I need to sleep sometime :-), so I've started
> pre-emptively rebooting compute nodes to get ahead of it.  Hopefully
> I'll be able to get all of the potentially broken nodes at least
> disabled by the end of the day so we'll have another 3 months before we
> have to worry about this again.
>
> On 03/24/2017 11:47 AM, Derek Higgins wrote:
>> On 22 March 2017 at 22:36, Ben Nemec <openstack at nemebean.com> wrote:
>>> Hi all (owl?),
>>>
>>> You may have missed it in all the ci excitement the past couple of
>>> days, but
>>> we had a partial outage of rh1 last night.  It turns out the OVS port
>>> issue
>>> Derek discussed in
>>> http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html
>>>
>>> reared its ugly head on a few of our compute nodes, which caused them
>>> to be
>>> unable to spawn new instances.  They kept getting scheduled since it
>>> looked
>>> like they were underutilized, which caused most of our testenvs to fail.
>>>
>>> I've rebooted the affected nodes, as well as a few more that looked like
>>> they might run into the same problem in the near future.  Everything
>>> looks
>>> to be working well again since sometime this morning (when I disabled
>>> the
>>> broken compute nodes), but there aren't many jobs passing due to the
>>> plethora of other issues we're hitting in ci.  There have been some
>>> stable
>>> job passes though so I believe things are working again.
>>>
>>> As far as preventing this in the future, the right thing to do would
>>> probably be to move to a later release of OpenStack (either point or
>>> major)
>>> where hopefully this problem would be fixed.  However, I'm hesitant
>>> to do
>>> that for a few reasons.  First is "the devil you know". Outside of this
>>> issue, we've gotten rh1 pretty rock solid lately.  It's been
>>> overworked, but
>>> has been cranking away for months with no major cloud-related outages.
>>> Second is that an upgrade would be a major process, probably
>>> involving some
>>> amount of downtime.  Since the long-term plan is to move everything
>>> to RDO
>>> cloud I'm not sure that's the best use of our time at this point.
>>
>> +1 on keeping the status quo until moving to rdo-cloud.
>>
>>>
>>> Instead, my plan for the near term is to keep a closer eye on the error
>>> notifications from the services.  We previously haven't had anything
>>> consuming those, but I've dropped a little tool on the controller
>>> that will
>>> dump out error notifications so we can watch for signs of this happening
>>> again.  I suspect the signs were there long before the actual breakage
>>> happened, but nobody was looking for them.  Now I will be.
>>>
>>> So that's where things stand with rh1.  Any comments or concerns
>>> welcome.
>>>
>>> Thanks.
>>>
>>> -Ben
>>>
>>> __________________________________________________________________________
>>>
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>> __________________________________________________________________________
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list