[openstack-dev] [gate] The gate: a failure analysis

Sean Dague sean at dague.net
Tue Jul 22 20:53:45 UTC 2014


On 07/22/2014 11:51 AM, Jay Pipes wrote:
> On 07/22/2014 10:48 AM, Chris Friesen wrote:
>> On 07/21/2014 12:03 PM, Clint Byrum wrote:
>>> Thanks Matthew for the analysis.
>>>
>>> I think you missed something though.
>>>
>>> Right now the frustration is that unrelated intermittent bugs stop your
>>> presumably good change from getting in.
>>>
>>> Without gating, the result would be that even more bugs, many of them
>>> not
>>> intermittent at all, would get in. Right now, the one random developer
>>> who has to hunt down the rechecks and do them is inconvenienced. But
>>> without a gate, _every single_ developer will be inconvenienced until
>>> the fix is merged.
>>
>> The problem I see with this is that it's fundamentally not a fair system.
>>
>> If someone is trying to fix a bug in the libvirt driver, it's wrong to
>> expect them to try to debug issues with neutron being unstable.  They
>> likely don't have the skillset to do it, and we shouldn't expect them to
>> do so.  It's a waste of developer time.
> 
> Who is expecting the developer to debug issues with Neutron? It may be a
> waste of developer time to constantly recheck certain bugs (or no bug),
> but nobody is saying to the contributor of a libvirt fix "Hey, this
> unrelated Neutron bug is causing a failure, so go fix it."
> 
> The point of the gate is specifically to provide the sort of rigidity
> that unfortunately manifests itself in discomfort from developers.
> Perhaps you don't have the history of when we had no strict gate, and it
> was a frequent source of frustration that code would sail through to
> master that would routinely break master and branches of other OpenStack
> projects. I, for one, don't want to revisit the bad old days. As much as
> a pain it is, the gate failures are a thorn in the side of folks
> precisely to push folks to fix the valid bugs that they highlight. What
> we need, like Sean said, is more folks fixing bugs and less folks
> working on features and vendor drivers.
> 
> Perhaps we, as a community, should make the bug triaging and fixing days
> a much more common thing? Maybe make Thursdays or Fridays dedicated bug
> days? How about monetary bug bounties being paid out by the OpenStack
> Foundation, with a payout scale based on the bug severity and
> importance? How about having dedicated bug-squashing teams that focus on
> a particular area of the code, that share their status reports at weekly
> meetings and on the ML?

Something that's somewhat relevant to this discussion is one that we had
last week in Darmstadt at the Infra / QA Sprint, it even has a pretty
picture (#notverypretty) - https://dague.net/2014/07/22/openstack-failures/

I think fairness is one of those things that's hard to figure out here.
Because while it might not seem fair to a developer that they can't land
their patch, lets consider the alternative, where we turned off all the
testing (or limited it to only things we were 100% sure would not false
negative). In that environment the review teams would have to be fair
more careful about what they approved, as there was no backstop. Which
means I'd expect the review queue to grow by many integer multiples. And
land time for patches to actually increase.

An alternative to the current space of "man it's annoying that my patch
gets killed by bugs some times" isn't "yay I'm landing all the codes!",
it's probably "hmmm, how do I get anyone to look at my code, it's been
up for review for 6 months." Especially for newer developers without a
track record that haven't built up trust.

This is basically what you see in Linux. We could always evolve the
community in that direction, but I'm not sure it's what people actually
want. But in Linux if you show up as a new person the chance of anyone
reviewing your code is effectively 0%.

Every systemic change we've ever had to the gating system has 2nd and
3rd order effects, some we predict, and some we don't. Aren't emergent
systems fun? :)

For instance, when we implemented clean check, which demonstrably
decreased the gate queue length during rush times, many people now felt
like the system was punishing them because their code had to make more
round trips in the system. But so does everyone elses, which means some
really dubious behavior by some of the core teams in approving code that
hadn't been tested recently now was blocked. That was one of the
contributing factors to the January backup. So while it means that if
you hit a bug, your patch has longer in the system, it actually means if
you don't, it is less likely to be stuck behind a ton of other failing code.

	-Sean

-- 
Sean Dague
http://dague.net



More information about the OpenStack-dev mailing list