[openstack-dev] [all] OpenStack races piling up in the gate - please stop approving patches unless they are fixing a race condition

Joe Gordon joe.gordon0 at gmail.com
Thu Jun 5 23:04:50 UTC 2014


On Thu, Jun 5, 2014 at 3:05 PM, Kyle Mestery <mestery at noironetworks.com>
wrote:

> On Thu, Jun 5, 2014 at 7:07 AM, Sean Dague <sean at dague.net> wrote:
> > You may all have noticed things are really backed up in the gate right
> > now, and you would be correct. (Top of gate is about 30 hrs, but if you
> > do the math on ingress / egress rates the gate is probably really double
> > that in transit time right now).
> >
> > We've hit another threshold where there are so many really small races
> > in the gate that they are compounding to the point where fixing one is
> > often failed by another one killing your job. This whole situation was
> > exacerbated by the fact that while the transition from HP cloud 1.0 ->
> > 1.1 was happening and we were under capacity, the check queue grew to
> > 500 with lots of stuff being approved.
> >
> > That flush all hit the gate at once. But it also means that those jobs
> > passed in a very specific timing situation, which is different on the
> > new HP cloud nodes. And the normal statistical distribution of some jobs
> > on RAX and some on HP that shake out different races didn't happen.
> >
> > At this point we could really use help getting focus on only recheck
> > bugs. The current list of bugs is here:
> > http://status.openstack.org/elastic-recheck/
> >
> > Also our categorization rate is only 75% so there are probably at least
> > 2 critical bugs we don't even know about yet hiding in the failures.
> > Helping categorize here -
> > http://status.openstack.org/elastic-recheck/data/uncategorized.html
> > would be handy.
> >
> > We're coordinating changes via an etherpad here -
> > https://etherpad.openstack.org/p/gatetriage-june2014
> >
> > If you want to help, jumping in #openstack-infra would be the place to
> go.
> >
> For the Neutron "ssh timeout" issue [1], we think we know why it's
> spiked recently. This tempest change [2] may have made the situation
> worse. We'd like to propose reverting that change with the review here
> [3], at which point we can resubmit it and continue debugging this.
> But this should help relieve the pressure caused by the recent surge
> in this bug.
>
> Does this sound like a workable plan to get things moving again?
>


As we discussed on IRC yes, and thank you for hunting this one down.



>
> Thanks,
> Kyle
>
> [1] https://bugs.launchpad.net/bugs/1323658
> [2] https://review.openstack.org/#/c/90427/
> [3] https://review.openstack.org/#/c/97245/
>
> >         -Sean
> >
> > --
> > Sean Dague
> > http://dague.net
> >
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140605/df6c2488/attachment.html>


More information about the OpenStack-dev mailing list