<p dir="ltr"><br>

On 16 Jun 2014 20:33, "Thierry Carrez" <<a href="mailto:thierry@openstack.org">thierry@openstack.org</a>> wrote:<br>

><br>

> Robert Collins wrote:<br>

> > [...]<br>

> > C - If we can't make it harder to get races in, perhaps we can make it<br>

> > easier to get races out. We have pretty solid emergent statistics from<br>

> > every gate job that is run as check. What if set a policy that when a<br>

> > gate queue gets a race:<br>

> >  - put a zuul stop all merges and checks on all involved branches<br>

> > (prevent further damage, free capacity for validation)<br>

> >  - figure out when it surfaced<br>

> >  - determine its not an external event<br>

> >  - revert all involved branches back to the point where they looked<br>

> > good, as one large operation<br>

> >    - run that through jenkins N (e.g. 458) times in parallel.<br>

> >    - on success land it<br>

> >  - go through all the merges that have been reverted and either<br>

> > twiddle them to be back in review with a new patchset against the<br>

> > revert to restore their content, or alternatively generate new reviews<br>

> > if gerrit would make that too hard.<br>

><br>

> One of the issues here is that "gate queue gets a race" is not a binary<br>

> state. There are always rare issues, you just can't find all the bugs<br>

> that happen 0.00001% of the time. You add more such issues, and at some<br>

> point they either add up to an unacceptable level, or some other<br>

> environmental situation suddenly increases the odds of some old rare<br>

> issue to happen (think: new test cluster with slightly different<br>

> performance characteristics being thrown into our test resources). There<br>

> is no single incident you need to find and fix, and during which you can<br>

> clearly escalate to defCon 1. You can't even assume that a "gate<br>

> situation" was created in the set of commits around when it surfaced.<br>

><br>

> So IMHO it's a continuous process : keep looking into rare issues all<br>

> the time, to maintain them under the level where they become a problem.<br>

> You can't just have a specific process that kicks in when "the gate<br>

> queue gets a race</p>

<p dir="ltr">You seem to be drawing different conclusions here but the emergent behaviour is a shared model that we both have. In no part of my mail did I suggest ignoring issues until we hit Defcon one. I suggested that what we are doing is not working, and put forward a model to explain why it's not working ... one which to me seems to fit the evidence. And finally suggested a few different things which might help.</p>


<p dir="ltr">For the specific scenario you raise that might not fit... Adding a test cluster is a change to our test config and certainly something we could revert. That's the benefit of configuration as code.</p>