[openstack-dev] Gate breakage process - Let's fix! (related but not specific to neutron)

Joe Gordon joe.gordon0 at gmail.com
Mon Aug 19 02:09:28 UTC 2013


On Sat, Aug 17, 2013 at 9:46 AM, Robert Collins
<robertc at robertcollins.net>wrote:

> On 17 August 2013 23:49, Salvatore Orlando <sorlando at nicira.com> wrote:
> > I tend to agree that when the gate for a project is broken, nothing
> should
> > be merged for that project until the gate jobs are green again.
> > In the case of Neutron, making the job non voting only caused more bugs
> to
> > slip through, and that meant more works for the developer themselves, and
> > more headaches for developers of other projects relying on it.
>
>
>
> > When dealing with intermittent failures, like the bug which probably
> started
> > the issues we've been witnessing in the past 3 weeks, I think it might a
> > sensible idea to make the job non-voting only for projects which surely
> > can't be the cause of the gate failure; or perhaps skip the offending
> test
> > only.
> >
> > This means however asymettrical gating, and from Monty's post it seems
> > there's something quite wrong with it. However, due to my lack of
> expertise
> > on the subject, I am unable to see the issue with it.
> >
> > Salvatore
>
> The asymmetry we should fear is when project A can land something
> something which will break project B. In this case the proposal is to
> say 'B is broken already, permit A to land things without remorse
> until B is unbroken'.
>
> The problem is, if A makes the breakage of B worse, B ends up in
> catchup mode, which is most unfun.
>
> Concretely, take heat for A and neutron for B. Tempest d-g jobs start
> failing in neutron, so they are made skips. Now heat could make
> neutron tests in tempest worse, and we won't know - or if we do know,
> they'll still land.
>
> Previous discussion here has endorsed 'revert problematic commits,
> it's not blame on the developer, just do it', so I'm not going to
> mention that.
>
> What I will suggest we do is start running some number - lets say 20 -
> of midnight state jobs, all identical. Ignoring datetime sensitive
> tests, which are fortunately rare, this should identify tests that
> fail 5% of the time, independent of incoming commits. We can use this
> to generate a baseline reference for which tests fail intermittently
> in trunk, and when something breaks intermittently outside of that
> set, we can be pretty *sure* it's in the last days commits.
>

+1, although we already have a manual vaguely similar version of this (
http://status.openstack.org/rechecks/)


>
> Secondly, in principle it should be straight forward to do this for
> any point in time, so when a new problem shows it's head, we can start
> a bisection up programmatically - independent of the dev analysis - to
> find where it was introduced. If we have resources we could even do
> N-section rather than bisection.


+1


>


> Killing all intermittent issues test suites is /hard/, so I think we
> need to have a belt-and-braces approach and engineer a rapid response
> system to spikes in intermittent failures, in addition to working on
> the failures themselves.


> -Rob
> --
> Robert Collins <rbtcollins at hp.com>
> Distinguished Technologist
> HP Converged Cloud
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20130818/5ffa64fa/attachment.html>


More information about the OpenStack-dev mailing list