[openstack-dev] [tripleo] rh1 issues post-mortem
openstack at nemebean.com
Tue Mar 28 22:01:34 UTC 2017
Final (hopefully) update:
All active compute nodes have been rebooted and things seem to be stable
again. Jobs are even running a little faster, so I'm thinking this had
a detrimental effect on performance too. I've set a reminder for about
two months from now to reboot again if we're still using this environment.
On 03/24/2017 12:48 PM, Ben Nemec wrote:
> To follow-up on this, we've continued to hit this issue on other compute
> nodes. Not surprising, of course. They've all been up for about the
> same period of time and have had largely even workloads.
> It has caused problems though because it is cropping up faster than I
> can respond (it takes a few hours to cycle all the instances off a
> compute node, and I need to sleep sometime :-), so I've started
> pre-emptively rebooting compute nodes to get ahead of it. Hopefully
> I'll be able to get all of the potentially broken nodes at least
> disabled by the end of the day so we'll have another 3 months before we
> have to worry about this again.
> On 03/24/2017 11:47 AM, Derek Higgins wrote:
>> On 22 March 2017 at 22:36, Ben Nemec <openstack at nemebean.com> wrote:
>>> Hi all (owl?),
>>> You may have missed it in all the ci excitement the past couple of
>>> days, but
>>> we had a partial outage of rh1 last night. It turns out the OVS port
>>> Derek discussed in
>>> reared its ugly head on a few of our compute nodes, which caused them
>>> to be
>>> unable to spawn new instances. They kept getting scheduled since it
>>> like they were underutilized, which caused most of our testenvs to fail.
>>> I've rebooted the affected nodes, as well as a few more that looked like
>>> they might run into the same problem in the near future. Everything
>>> to be working well again since sometime this morning (when I disabled
>>> broken compute nodes), but there aren't many jobs passing due to the
>>> plethora of other issues we're hitting in ci. There have been some
>>> job passes though so I believe things are working again.
>>> As far as preventing this in the future, the right thing to do would
>>> probably be to move to a later release of OpenStack (either point or
>>> where hopefully this problem would be fixed. However, I'm hesitant
>>> to do
>>> that for a few reasons. First is "the devil you know". Outside of this
>>> issue, we've gotten rh1 pretty rock solid lately. It's been
>>> overworked, but
>>> has been cranking away for months with no major cloud-related outages.
>>> Second is that an upgrade would be a major process, probably
>>> involving some
>>> amount of downtime. Since the long-term plan is to move everything
>>> to RDO
>>> cloud I'm not sure that's the best use of our time at this point.
>> +1 on keeping the status quo until moving to rdo-cloud.
>>> Instead, my plan for the near term is to keep a closer eye on the error
>>> notifications from the services. We previously haven't had anything
>>> consuming those, but I've dropped a little tool on the controller
>>> that will
>>> dump out error notifications so we can watch for signs of this happening
>>> again. I suspect the signs were there long before the actual breakage
>>> happened, but nobody was looking for them. Now I will be.
>>> So that's where things stand with rh1. Any comments or concerns
>>> OpenStack Development Mailing List (not for usage questions)
>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> OpenStack Development Mailing List (not for usage questions)
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
More information about the OpenStack-dev