[openstack-dev] [tripleo] rh1 issues post-mortem
Ben Nemec
openstack at nemebean.com
Wed Mar 22 22:36:23 UTC 2017
Hi all (owl?),
You may have missed it in all the ci excitement the past couple of days,
but we had a partial outage of rh1 last night. It turns out the OVS
port issue Derek discussed in
http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html
reared its ugly head on a few of our compute nodes, which caused them to
be unable to spawn new instances. They kept getting scheduled since it
looked like they were underutilized, which caused most of our testenvs
to fail.
I've rebooted the affected nodes, as well as a few more that looked like
they might run into the same problem in the near future. Everything
looks to be working well again since sometime this morning (when I
disabled the broken compute nodes), but there aren't many jobs passing
due to the plethora of other issues we're hitting in ci. There have
been some stable job passes though so I believe things are working again.
As far as preventing this in the future, the right thing to do would
probably be to move to a later release of OpenStack (either point or
major) where hopefully this problem would be fixed. However, I'm
hesitant to do that for a few reasons. First is "the devil you know".
Outside of this issue, we've gotten rh1 pretty rock solid lately. It's
been overworked, but has been cranking away for months with no major
cloud-related outages. Second is that an upgrade would be a major
process, probably involving some amount of downtime. Since the
long-term plan is to move everything to RDO cloud I'm not sure that's
the best use of our time at this point.
Instead, my plan for the near term is to keep a closer eye on the error
notifications from the services. We previously haven't had anything
consuming those, but I've dropped a little tool on the controller that
will dump out error notifications so we can watch for signs of this
happening again. I suspect the signs were there long before the actual
breakage happened, but nobody was looking for them. Now I will be.
So that's where things stand with rh1. Any comments or concerns welcome.
Thanks.
-Ben
More information about the OpenStack-dev
mailing list