[openstack-dev] [all] OpenStack races piling up in the gate - please stop approving patches unless they are fixing a race condition

Kevin Benton blak111 at gmail.com
Fri Jun 6 02:50:30 UTC 2014


Oh cool. I didn't realize it was deliberately limited already. I had
assumed it was just hitting the resource limits for that queue.

So it looks like it's around 20 now. However, I would argue that shortening
it more would help get patches through the gate.

For the sake of discussion, let's assume there is a 80% chance of success
in one test run on a patch. So a given patch's probability of success is
.8^n where n is the number of runs.

For the 1st patch in the queue, n is just one.
For the 2nd patch, n is 1 + the probability of a failure from patch 1.
For the 3rd patch, n is 1 + the probability of a failure in patch 2 or 1.
For the 4th patch, n is 1 + the probability of a failure in patch 3, 2, or
1.
...

Unfortunately my conditional probability skills are too shaky to trust an
equation I come up with to represent the above scenario so I wrote a gate
failure simulator [1].

At a queue size of 20 and an 80% success rate. The patch in position 20
only has a ~44% chance of getting merged.
However, with a queue size of 4, the patch in position 4 has a ~71% chance
of getting merged.

You can try the simulator out yourself with various numbers. Maybe the odds
of success are much better than 80% in one run and my point is moot, but I
have several patches waiting to be merged that haven't made it through
after ~3 tries each.


Cheers,
Kevin Benton

1. http://paste.openstack.org/show/83039/


On Thu, Jun 5, 2014 at 4:04 PM, Joe Gordon <joe.gordon0 at gmail.com> wrote:

>
>
>
> On Thu, Jun 5, 2014 at 3:29 PM, Kevin Benton <blak111 at gmail.com> wrote:
>
>> Is it possible to make the depth of patches running tests in the gate
>> very shallow during this high-probability of failure time? e.g. Allow only
>> the top 4 to run tests and put the rest in the 'queued' state. Otherwise
>> the already elevated probability of a patch failing is exacerbated by the
>> fact that it gets retested every time a patch ahead of it in the queue
>> fails.
>>
>> Such a good idea that we already do it.
>
> http://status.openstack.org/zuul/
>
> The grey circles refer to patches that are in the queued state. But this
> only helps us from hitting resource starvation but doesn't help us get
> patches through the gate. We haven't  been landing many patches this week
> [0]
>
> [0] https://github.com/openstack/openstack/graphs/commit-activity
>
>
>> --
>> Kevin Benton
>>
>>
>> On Thu, Jun 5, 2014 at 5:07 AM, Sean Dague <sean at dague.net> wrote:
>>
>>> You may all have noticed things are really backed up in the gate right
>>> now, and you would be correct. (Top of gate is about 30 hrs, but if you
>>> do the math on ingress / egress rates the gate is probably really double
>>> that in transit time right now).
>>>
>>> We've hit another threshold where there are so many really small races
>>> in the gate that they are compounding to the point where fixing one is
>>> often failed by another one killing your job. This whole situation was
>>> exacerbated by the fact that while the transition from HP cloud 1.0 ->
>>> 1.1 was happening and we were under capacity, the check queue grew to
>>> 500 with lots of stuff being approved.
>>>
>>> That flush all hit the gate at once. But it also means that those jobs
>>> passed in a very specific timing situation, which is different on the
>>> new HP cloud nodes. And the normal statistical distribution of some jobs
>>> on RAX and some on HP that shake out different races didn't happen.
>>>
>>> At this point we could really use help getting focus on only recheck
>>> bugs. The current list of bugs is here:
>>> http://status.openstack.org/elastic-recheck/
>>>
>>> Also our categorization rate is only 75% so there are probably at least
>>> 2 critical bugs we don't even know about yet hiding in the failures.
>>> Helping categorize here -
>>> http://status.openstack.org/elastic-recheck/data/uncategorized.html
>>> would be handy.
>>>
>>> We're coordinating changes via an etherpad here -
>>> https://etherpad.openstack.org/p/gatetriage-june2014
>>>
>>> If you want to help, jumping in #openstack-infra would be the place to
>>> go.
>>>
>>>         -Sean
>>>
>>> --
>>> Sean Dague
>>> http://dague.net
>>>
>>>
>>> _______________________________________________
>>> OpenStack-dev mailing list
>>> OpenStack-dev at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>>
>>
>>
>> --
>> Kevin Benton
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>


-- 
Kevin Benton
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140605/363c66e5/attachment.html>


More information about the OpenStack-dev mailing list