[openstack-dev] Gate proposal - drop Postgresql configurations in the gate

Thierry Carrez thierry at openstack.org
Mon Jun 16 08:33:35 UTC 2014

Robert Collins wrote:
> [...]
> C - If we can't make it harder to get races in, perhaps we can make it
> easier to get races out. We have pretty solid emergent statistics from
> every gate job that is run as check. What if set a policy that when a
> gate queue gets a race:
>  - put a zuul stop all merges and checks on all involved branches
> (prevent further damage, free capacity for validation)
>  - figure out when it surfaced
>  - determine its not an external event
>  - revert all involved branches back to the point where they looked
> good, as one large operation
>    - run that through jenkins N (e.g. 458) times in parallel.
>    - on success land it
>  - go through all the merges that have been reverted and either
> twiddle them to be back in review with a new patchset against the
> revert to restore their content, or alternatively generate new reviews
> if gerrit would make that too hard.

One of the issues here is that "gate queue gets a race" is not a binary
state. There are always rare issues, you just can't find all the bugs
that happen 0.00001% of the time. You add more such issues, and at some
point they either add up to an unacceptable level, or some other
environmental situation suddenly increases the odds of some old rare
issue to happen (think: new test cluster with slightly different
performance characteristics being thrown into our test resources). There
is no single incident you need to find and fix, and during which you can
clearly escalate to defCon 1. You can't even assume that a "gate
situation" was created in the set of commits around when it surfaced.

So IMHO it's a continuous process : keep looking into rare issues all
the time, to maintain them under the level where they become a problem.
You can't just have a specific process that kicks in when "the gate
queue gets a race".

Thierry Carrez (ttx)

