[openstack-dev] Unwedging the gate

Sean Dague sean at dague.net
Tue Nov 26 13:56:28 UTC 2013

On 11/25/2013 09:14 PM, James E. Blair wrote:
> Joe Gordon <joe.gordon0 at gmail.com> writes:
>> On Sun, Nov 24, 2013 at 10:48 PM, Robert Collins
>> <robertc at robertcollins.net>wrote:
>>> On 25 November 2013 19:25, Joe Gordon <joe.gordon0 at gmail.com> wrote:
>>>> On Sun, Nov 24, 2013 at 9:58 PM, Robert Collins <
>>> robertc at robertcollins.net>
>>>> wrote:
>>>>> I have a proposal - I think we should mark all recheck bugs critical,
>>>>> and the respective project PTLs should actively shop around amongst
>>>>> their contributors to get them fixed before other work: we should
>>>>> drive the known set of nondeterministic issues down to 0 and keep it
>>>>> there.
>>>> Yes! In fact we are already working towards that. See
>>> http://lists.openstack.org/pipermail/openstack-dev/2013-November/020048.html
>>> Indeed I saw that thread - I think I'm proposing something slightly
>>> different, or perhaps 'gate blocking' needs clearing up. Which is -
>>> that once we have sufficient evidence to believe there is a
>>> nondeterministic bug in trunk, whether or not the gate is obviously
>>> suffering, we should consider it critical immediately. I don't think
>>> we need 24h action on such bugs at that stage - gate blocking zomg
>>> issues obviously do though!
>> I see what your saying. That sounds like a good idea, all gate bugs are
>> critical, but only zomg gate is bad gets 24h action.
> This is fundamentally the same idea -- we're talking about degrees.  And
> I'm afraid that the difference in degree between a "gate bug" and a
> "zomg gate bug" has more to do with the number of changes in the gate
> queue than the bug itself.
> So yeah, my proposal is that nondeterministic bugs that show up in the
> gate should be marked critical, and the expectation is that PTLs should
> help get people assigned to them.
> Nondeterministic bugs that show up in the gate with no one working on
> them are just waiting for a big queue or another nondetermistic bug to
> come along and halt everything.

Or a subtle timing change in a cloud, or one more test which reorders
the order testr runs in, or... lots of things.

Honestly, since elastic recheck came onto the scene at the end of Havana
we've gained a lot of insight on the problems of these races, how they
are really lots of little races, not a few big ones, and how the odds
end up against us with a long gate because a 40 deep queue, with a reset
happening every 10 changes (all for instance), means we actually have
end up with 40 + 30 + 20 + 10 change events in the gate, so 2.5x the
linear path, which makes this thing go geometric pretty quickly.

So I think we need to treat any race as potentially the one that's going
to kill us. Because, experience has shown we never really know which one
will put us over the edge. And the cyclic nature of these gate halts
demonstrates that a lot of people don't look at the issues until it
actually prevents them from merging code.

There is also another issue, we've turned off some tests in tempest
because they find races at a regular enough rate that they, all by
themselves, do a pretty good job of wedging the gate, and there were
times when we needed to just get things under control (because otherwise
they prevented fixes for some of the other races from getting in).

We really need to declare some times where we're going to flip these on
intentionally, and that people are signing up to go after the fallout.
The math on the race conditions shows that we can't actually keep them
from landing, but once they do, and we see them, if we can amplify their
likelyhood we can keep them from coming back.

It's important to remember that all these races we find with the gate,
can and will be tripped over by real people on their OpenStack
deployments. Not all people, not all deployments, but I'm sure these
issues are actually seen out there. Also these will give us a 2 day long
merge queue on Icehouse-3. Not might, will. So if people want the
ability to get their code merged during feature freeze, the time to act
is now.


Sean Dague

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131126/14bd3713/attachment.pgp>

More information about the OpenStack-dev mailing list