<p dir="ltr"><br>

On 16 Jun 2014 22:33, "Sean Dague" <<a href="mailto:sean@dague.net">sean@dague.net</a>> wrote:<br>

><br>

> On 06/16/2014 04:33 AM, Thierry Carrez wrote:<br>

> > Robert Collins wrote:<br>

> >> [...]<br>

> >> C - If we can't make it harder to get races in, perhaps we can make it<br>

> >> easier to get races out. We have pretty solid emergent statistics from<br>

> >> every gate job that is run as check. What if set a policy that when a<br>

> >> gate queue gets a race:<br>

> >>  - put a zuul stop all merges and checks on all involved branches<br>

> >> (prevent further damage, free capacity for validation)<br>

> >>  - figure out when it surfaced<br>

> >>  - determine its not an external event<br>

> >>  - revert all involved branches back to the point where they looked<br>

> >> good, as one large operation<br>

> >>    - run that through jenkins N (e.g. 458) times in parallel.<br>

> >>    - on success land it<br>

> >>  - go through all the merges that have been reverted and either<br>

> >> twiddle them to be back in review with a new patchset against the<br>

> >> revert to restore their content, or alternatively generate new reviews<br>

> >> if gerrit would make that too hard.<br>

> ><br>

> > One of the issues here is that "gate queue gets a race" is not a binary<br>

> > state. There are always rare issues, you just can't find all the bugs<br>

> > that happen 0.00001% of the time. You add more such issues, and at some<br>

> > point they either add up to an unacceptable level, or some other<br>

> > environmental situation suddenly increases the odds of some old rare<br>

> > issue to happen (think: new test cluster with slightly different<br>

> > performance characteristics being thrown into our test resources). There<br>

> > is no single incident you need to find and fix, and during which you can<br>

> > clearly escalate to defCon 1. You can't even assume that a "gate<br>

> > situation" was created in the set of commits around when it surfaced.<br>

> ><br>

> > So IMHO it's a continuous process : keep looking into rare issues all<br>

> > the time, to maintain them under the level where they become a problem.<br>

> > You can't just have a specific process that kicks in when "the gate<br>

> > queue gets a race".<br>

><br>

> Definitely agree. I also think part of the issue is we get emergent<br>

> behavior once we tip past some cumulative failure rate. Much of that<br>

> emergent behavior we are coming to understand over time. We've done<br>

> corrections like clean check and sliding gate window to impact them.<br>

><br>

> It's also that a new issue tends to take 12 hrs to see and figure out if<br>

> it's a ZOMG issue, and 3 - 5 days to see if it's any lower level of<br>

> severity. And given that we merge 50 - 100 patches a day, across 40<br>

> projects, across branches, the rollback would be .... 'interesting'.</p>

<p dir="ltr">So zomg - 50 runs and lower issues between 150 and 500 test runs. That's fitting my model pretty well for the ballpark failure rate and margin I was using. That is it sounds like the model isn't too far out from reality.</p>


<p dir="ltr">Yes revert would be hard... But what do you think of the model ... Is it wrong? It implies Sergei different points we can try to fix things and I would love to know what folk think of the other possibilities I've raised or raise some themselves.</p>


<p dir="ltr">-Rob<br></p>