[openstack-dev] [tripleo] rh1 issues post-mortem

Ben Nemec openstack at nemebean.com
Fri Jun 2 20:42:31 UTC 2017



On 03/28/2017 05:01 PM, Ben Nemec wrote:
> Final (hopefully) update:
>
> All active compute nodes have been rebooted and things seem to be stable
> again.  Jobs are even running a little faster, so I'm thinking this had
> a detrimental effect on performance too.  I've set a reminder for about
> two months from now to reboot again if we're still using this environment.

The reminder popped up this week, and I've rebooted all the compute 
nodes again.  It went pretty smoothly so I doubt anyone noticed that it 
happened (except that I forgot to restart the zuul-status webapp), but 
if you run across any problems let me know.

>
> On 03/24/2017 12:48 PM, Ben Nemec wrote:
>> To follow-up on this, we've continued to hit this issue on other compute
>> nodes.  Not surprising, of course.  They've all been up for about the
>> same period of time and have had largely even workloads.
>>
>> It has caused problems though because it is cropping up faster than I
>> can respond (it takes a few hours to cycle all the instances off a
>> compute node, and I need to sleep sometime :-), so I've started
>> pre-emptively rebooting compute nodes to get ahead of it.  Hopefully
>> I'll be able to get all of the potentially broken nodes at least
>> disabled by the end of the day so we'll have another 3 months before we
>> have to worry about this again.
>>
>> On 03/24/2017 11:47 AM, Derek Higgins wrote:
>>> On 22 March 2017 at 22:36, Ben Nemec <openstack at nemebean.com> wrote:
>>>> Hi all (owl?),
>>>>
>>>> You may have missed it in all the ci excitement the past couple of
>>>> days, but
>>>> we had a partial outage of rh1 last night.  It turns out the OVS port
>>>> issue
>>>> Derek discussed in
>>>> http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html
>>>>
>>>>
>>>> reared its ugly head on a few of our compute nodes, which caused them
>>>> to be
>>>> unable to spawn new instances.  They kept getting scheduled since it
>>>> looked
>>>> like they were underutilized, which caused most of our testenvs to
>>>> fail.
>>>>
>>>> I've rebooted the affected nodes, as well as a few more that looked
>>>> like
>>>> they might run into the same problem in the near future.  Everything
>>>> looks
>>>> to be working well again since sometime this morning (when I disabled
>>>> the
>>>> broken compute nodes), but there aren't many jobs passing due to the
>>>> plethora of other issues we're hitting in ci.  There have been some
>>>> stable
>>>> job passes though so I believe things are working again.
>>>>
>>>> As far as preventing this in the future, the right thing to do would
>>>> probably be to move to a later release of OpenStack (either point or
>>>> major)
>>>> where hopefully this problem would be fixed.  However, I'm hesitant
>>>> to do
>>>> that for a few reasons.  First is "the devil you know". Outside of this
>>>> issue, we've gotten rh1 pretty rock solid lately.  It's been
>>>> overworked, but
>>>> has been cranking away for months with no major cloud-related outages.
>>>> Second is that an upgrade would be a major process, probably
>>>> involving some
>>>> amount of downtime.  Since the long-term plan is to move everything
>>>> to RDO
>>>> cloud I'm not sure that's the best use of our time at this point.
>>>
>>> +1 on keeping the status quo until moving to rdo-cloud.
>>>
>>>>
>>>> Instead, my plan for the near term is to keep a closer eye on the error
>>>> notifications from the services.  We previously haven't had anything
>>>> consuming those, but I've dropped a little tool on the controller
>>>> that will
>>>> dump out error notifications so we can watch for signs of this
>>>> happening
>>>> again.  I suspect the signs were there long before the actual breakage
>>>> happened, but nobody was looking for them.  Now I will be.
>>>>
>>>> So that's where things stand with rh1.  Any comments or concerns
>>>> welcome.
>>>>
>>>> Thanks.
>>>>
>>>> -Ben
>>>>
>>>> __________________________________________________________________________
>>>>
>>>>
>>>> OpenStack Development Mailing List (not for usage questions)
>>>> Unsubscribe:
>>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>> __________________________________________________________________________
>>>
>>>
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>> __________________________________________________________________________
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list