[openstack-dev] [tripleo] rh1 issues post-mortem

Derek Higgins derekh at redhat.com
Fri Mar 24 16:47:58 UTC 2017


On 22 March 2017 at 22:36, Ben Nemec <openstack at nemebean.com> wrote:
> Hi all (owl?),
>
> You may have missed it in all the ci excitement the past couple of days, but
> we had a partial outage of rh1 last night.  It turns out the OVS port issue
> Derek discussed in
> http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html
> reared its ugly head on a few of our compute nodes, which caused them to be
> unable to spawn new instances.  They kept getting scheduled since it looked
> like they were underutilized, which caused most of our testenvs to fail.
>
> I've rebooted the affected nodes, as well as a few more that looked like
> they might run into the same problem in the near future.  Everything looks
> to be working well again since sometime this morning (when I disabled the
> broken compute nodes), but there aren't many jobs passing due to the
> plethora of other issues we're hitting in ci.  There have been some stable
> job passes though so I believe things are working again.
>
> As far as preventing this in the future, the right thing to do would
> probably be to move to a later release of OpenStack (either point or major)
> where hopefully this problem would be fixed.  However, I'm hesitant to do
> that for a few reasons.  First is "the devil you know". Outside of this
> issue, we've gotten rh1 pretty rock solid lately.  It's been overworked, but
> has been cranking away for months with no major cloud-related outages.
> Second is that an upgrade would be a major process, probably involving some
> amount of downtime.  Since the long-term plan is to move everything to RDO
> cloud I'm not sure that's the best use of our time at this point.

+1 on keeping the status quo until moving to rdo-cloud.

>
> Instead, my plan for the near term is to keep a closer eye on the error
> notifications from the services.  We previously haven't had anything
> consuming those, but I've dropped a little tool on the controller that will
> dump out error notifications so we can watch for signs of this happening
> again.  I suspect the signs were there long before the actual breakage
> happened, but nobody was looking for them.  Now I will be.
>
> So that's where things stand with rh1.  Any comments or concerns welcome.
>
> Thanks.
>
> -Ben
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list