[openstack-dev] Gate proposal - drop Postgresql configurations in the gate

Robert Collins robertc at robertcollins.net
Mon Jun 16 17:54:49 UTC 2014


On 16 Jun 2014 20:33, "Thierry Carrez" <thierry at openstack.org> wrote:
>
> Robert Collins wrote:
> > [...]
> > C - If we can't make it harder to get races in, perhaps we can make it
> > easier to get races out. We have pretty solid emergent statistics from
> > every gate job that is run as check. What if set a policy that when a
> > gate queue gets a race:
> >  - put a zuul stop all merges and checks on all involved branches
> > (prevent further damage, free capacity for validation)
> >  - figure out when it surfaced
> >  - determine its not an external event
> >  - revert all involved branches back to the point where they looked
> > good, as one large operation
> >    - run that through jenkins N (e.g. 458) times in parallel.
> >    - on success land it
> >  - go through all the merges that have been reverted and either
> > twiddle them to be back in review with a new patchset against the
> > revert to restore their content, or alternatively generate new reviews
> > if gerrit would make that too hard.
>
> One of the issues here is that "gate queue gets a race" is not a binary
> state. There are always rare issues, you just can't find all the bugs
> that happen 0.00001% of the time. You add more such issues, and at some
> point they either add up to an unacceptable level, or some other
> environmental situation suddenly increases the odds of some old rare
> issue to happen (think: new test cluster with slightly different
> performance characteristics being thrown into our test resources). There
> is no single incident you need to find and fix, and during which you can
> clearly escalate to defCon 1. You can't even assume that a "gate
> situation" was created in the set of commits around when it surfaced.
>
> So IMHO it's a continuous process : keep looking into rare issues all
> the time, to maintain them under the level where they become a problem.
> You can't just have a specific process that kicks in when "the gate
> queue gets a race

You seem to be drawing different conclusions here but the emergent
behaviour is a shared model that we both have. In no part of my mail did I
suggest ignoring issues until we hit Defcon one. I suggested that what we
are doing is not working, and put forward a model to explain why it's not
working ... one which to me seems to fit the evidence. And finally
suggested a few different things which might help.

For the specific scenario you raise that might not fit... Adding a test
cluster is a change to our test config and certainly something we could
revert. That's the benefit of configuration as code.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140617/96bf934a/attachment.html>


More information about the OpenStack-dev mailing list