[openstack-dev] Gate breakage process - Let's fix! (related but not specific to neutron)

Robert Collins robertc at robertcollins.net
Sat Aug 17 13:46:48 UTC 2013

On 17 August 2013 23:49, Salvatore Orlando <sorlando at nicira.com> wrote:
> I tend to agree that when the gate for a project is broken, nothing should
> be merged for that project until the gate jobs are green again.
> In the case of Neutron, making the job non voting only caused more bugs to
> slip through, and that meant more works for the developer themselves, and
> more headaches for developers of other projects relying on it.

> When dealing with intermittent failures, like the bug which probably started
> the issues we've been witnessing in the past 3 weeks, I think it might a
> sensible idea to make the job non-voting only for projects which surely
> can't be the cause of the gate failure; or perhaps skip the offending test
> only.
> This means however asymettrical gating, and from Monty's post it seems
> there's something quite wrong with it. However, due to my lack of expertise
> on the subject, I am unable to see the issue with it.
> Salvatore

The asymmetry we should fear is when project A can land something
something which will break project B. In this case the proposal is to
say 'B is broken already, permit A to land things without remorse
until B is unbroken'.

The problem is, if A makes the breakage of B worse, B ends up in
catchup mode, which is most unfun.

Concretely, take heat for A and neutron for B. Tempest d-g jobs start
failing in neutron, so they are made skips. Now heat could make
neutron tests in tempest worse, and we won't know - or if we do know,
they'll still land.

Previous discussion here has endorsed 'revert problematic commits,
it's not blame on the developer, just do it', so I'm not going to
mention that.

What I will suggest we do is start running some number - lets say 20 -
of midnight state jobs, all identical. Ignoring datetime sensitive
tests, which are fortunately rare, this should identify tests that
fail 5% of the time, independent of incoming commits. We can use this
to generate a baseline reference for which tests fail intermittently
in trunk, and when something breaks intermittently outside of that
set, we can be pretty *sure* it's in the last days commits.

Secondly, in principle it should be straight forward to do this for
any point in time, so when a new problem shows it's head, we can start
a bisection up programmatically - independent of the dev analysis - to
find where it was introduced. If we have resources we could even do
N-section rather than bisection.

Killing all intermittent issues test suites is /hard/, so I think we
need to have a belt-and-braces approach and engineer a rapid response
system to spikes in intermittent failures, in addition to working on
the failures themselves.

Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Converged Cloud

More information about the OpenStack-dev mailing list