[openstack-dev] Gate proposal - drop Postgresql configurations in the gate

Robert Collins robertc at robertcollins.net
Mon Jun 16 18:04:44 UTC 2014


On 16 Jun 2014 22:33, "Sean Dague" <sean at dague.net> wrote:
>
> On 06/16/2014 04:33 AM, Thierry Carrez wrote:
> > Robert Collins wrote:
> >> [...]
> >> C - If we can't make it harder to get races in, perhaps we can make it
> >> easier to get races out. We have pretty solid emergent statistics from
> >> every gate job that is run as check. What if set a policy that when a
> >> gate queue gets a race:
> >>  - put a zuul stop all merges and checks on all involved branches
> >> (prevent further damage, free capacity for validation)
> >>  - figure out when it surfaced
> >>  - determine its not an external event
> >>  - revert all involved branches back to the point where they looked
> >> good, as one large operation
> >>    - run that through jenkins N (e.g. 458) times in parallel.
> >>    - on success land it
> >>  - go through all the merges that have been reverted and either
> >> twiddle them to be back in review with a new patchset against the
> >> revert to restore their content, or alternatively generate new reviews
> >> if gerrit would make that too hard.
> >
> > One of the issues here is that "gate queue gets a race" is not a binary
> > state. There are always rare issues, you just can't find all the bugs
> > that happen 0.00001% of the time. You add more such issues, and at some
> > point they either add up to an unacceptable level, or some other
> > environmental situation suddenly increases the odds of some old rare
> > issue to happen (think: new test cluster with slightly different
> > performance characteristics being thrown into our test resources). There
> > is no single incident you need to find and fix, and during which you can
> > clearly escalate to defCon 1. You can't even assume that a "gate
> > situation" was created in the set of commits around when it surfaced.
> >
> > So IMHO it's a continuous process : keep looking into rare issues all
> > the time, to maintain them under the level where they become a problem.
> > You can't just have a specific process that kicks in when "the gate
> > queue gets a race".
>
> Definitely agree. I also think part of the issue is we get emergent
> behavior once we tip past some cumulative failure rate. Much of that
> emergent behavior we are coming to understand over time. We've done
> corrections like clean check and sliding gate window to impact them.
>
> It's also that a new issue tends to take 12 hrs to see and figure out if
> it's a ZOMG issue, and 3 - 5 days to see if it's any lower level of
> severity. And given that we merge 50 - 100 patches a day, across 40
> projects, across branches, the rollback would be .... 'interesting'.

So zomg - 50 runs and lower issues between 150 and 500 test runs. That's
fitting my model pretty well for the ballpark failure rate and margin I was
using. That is it sounds like the model isn't too far out from reality.

Yes revert would be hard... But what do you think of the model ... Is it
wrong? It implies Sergei different points we can try to fix things and I
would love to know what folk think of the other possibilities I've raised
or raise some themselves.

-Rob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140617/f5ba0a5e/attachment.html>


More information about the OpenStack-dev mailing list