[openstack-dev] Gate proposal - drop Postgresql configurations in the gate

Sean Dague sean at dague.net
Mon Jun 16 10:33:04 UTC 2014


On 06/16/2014 04:33 AM, Thierry Carrez wrote:
> Robert Collins wrote:
>> [...]
>> C - If we can't make it harder to get races in, perhaps we can make it
>> easier to get races out. We have pretty solid emergent statistics from
>> every gate job that is run as check. What if set a policy that when a
>> gate queue gets a race:
>>  - put a zuul stop all merges and checks on all involved branches
>> (prevent further damage, free capacity for validation)
>>  - figure out when it surfaced
>>  - determine its not an external event
>>  - revert all involved branches back to the point where they looked
>> good, as one large operation
>>    - run that through jenkins N (e.g. 458) times in parallel.
>>    - on success land it
>>  - go through all the merges that have been reverted and either
>> twiddle them to be back in review with a new patchset against the
>> revert to restore their content, or alternatively generate new reviews
>> if gerrit would make that too hard.
> 
> One of the issues here is that "gate queue gets a race" is not a binary
> state. There are always rare issues, you just can't find all the bugs
> that happen 0.00001% of the time. You add more such issues, and at some
> point they either add up to an unacceptable level, or some other
> environmental situation suddenly increases the odds of some old rare
> issue to happen (think: new test cluster with slightly different
> performance characteristics being thrown into our test resources). There
> is no single incident you need to find and fix, and during which you can
> clearly escalate to defCon 1. You can't even assume that a "gate
> situation" was created in the set of commits around when it surfaced.
> 
> So IMHO it's a continuous process : keep looking into rare issues all
> the time, to maintain them under the level where they become a problem.
> You can't just have a specific process that kicks in when "the gate
> queue gets a race".

Definitely agree. I also think part of the issue is we get emergent
behavior once we tip past some cumulative failure rate. Much of that
emergent behavior we are coming to understand over time. We've done
corrections like clean check and sliding gate window to impact them.

It's also that a new issue tends to take 12 hrs to see and figure out if
it's a ZOMG issue, and 3 - 5 days to see if it's any lower level of
severity. And given that we merge 50 - 100 patches a day, across 40
projects, across branches, the rollback would be .... 'interesting'.

	-Sean.

-- 
Sean Dague
http://dague.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140616/bf93dee7/attachment.pgp>


More information about the OpenStack-dev mailing list