[openstack-dev] [tripleo] rh1 issues post-mortem
derekh at redhat.com
Fri Mar 24 16:47:58 UTC 2017
On 22 March 2017 at 22:36, Ben Nemec <openstack at nemebean.com> wrote:
> Hi all (owl?),
> You may have missed it in all the ci excitement the past couple of days, but
> we had a partial outage of rh1 last night. It turns out the OVS port issue
> Derek discussed in
> reared its ugly head on a few of our compute nodes, which caused them to be
> unable to spawn new instances. They kept getting scheduled since it looked
> like they were underutilized, which caused most of our testenvs to fail.
> I've rebooted the affected nodes, as well as a few more that looked like
> they might run into the same problem in the near future. Everything looks
> to be working well again since sometime this morning (when I disabled the
> broken compute nodes), but there aren't many jobs passing due to the
> plethora of other issues we're hitting in ci. There have been some stable
> job passes though so I believe things are working again.
> As far as preventing this in the future, the right thing to do would
> probably be to move to a later release of OpenStack (either point or major)
> where hopefully this problem would be fixed. However, I'm hesitant to do
> that for a few reasons. First is "the devil you know". Outside of this
> issue, we've gotten rh1 pretty rock solid lately. It's been overworked, but
> has been cranking away for months with no major cloud-related outages.
> Second is that an upgrade would be a major process, probably involving some
> amount of downtime. Since the long-term plan is to move everything to RDO
> cloud I'm not sure that's the best use of our time at this point.
+1 on keeping the status quo until moving to rdo-cloud.
> Instead, my plan for the near term is to keep a closer eye on the error
> notifications from the services. We previously haven't had anything
> consuming those, but I've dropped a little tool on the controller that will
> dump out error notifications so we can watch for signs of this happening
> again. I suspect the signs were there long before the actual breakage
> happened, but nobody was looking for them. Now I will be.
> So that's where things stand with rh1. Any comments or concerns welcome.
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
More information about the OpenStack-dev