[openstack-dev] [gate] The gate: a failure analysis

Sean Dague sean at dague.net
Mon Jul 21 21:41:33 UTC 2014


On 07/21/2014 04:39 PM, David Kranz wrote:
> On 07/21/2014 04:13 PM, Jay Pipes wrote:
>> On 07/21/2014 02:03 PM, Clint Byrum wrote:
>>> Thanks Matthew for the analysis.
>>>
>>> I think you missed something though.
>>>
>>> Right now the frustration is that unrelated intermittent bugs stop your
>>> presumably good change from getting in.
>>>
>>> Without gating, the result would be that even more bugs, many of them
>>> not
>>> intermittent at all, would get in. Right now, the one random developer
>>> who has to hunt down the rechecks and do them is inconvenienced. But
>>> without a gate, _every single_ developer will be inconvenienced until
>>> the fix is merged.
>>>
>>> The false negative rate is _way_ too high. Nobody would disagree there.
>>> However, adding more false negatives and allowing more people to ignore
>>> the ones we already have, seems like it would have the opposite effect:
>>> Now instead of annoying the people who hit the random intermittent bugs,
>>> we'll be annoying _everybody_ as they hit the non-intermittent ones.
>>
>> +10
>>
> Right, but perhaps there is a middle ground. We must not allow changes
> in that can't pass through the gate, but we can separate the problems
> of constant rechecks using too many resources, and of constant rechecks
> causing developer pain. If failures were deterministic we would skip the
> failing tests until they were fixed. Unfortunately many of the common
> failures can blow up any test, or even the whole process. Following on
> what Sam said, what if we automatically reran jobs that failed in a
> known way, and disallowed "recheck/reverify no bug"? Developers would
> then have to track down what bug caused a failure or file a new one. But
> they would have to do so much less frequently, and as more common
> failures were catalogued it would become less and less frequent.

Elastic Recheck was never meant for this purpose. It doesn't tell you
all the bugs that were in your job, it just tells you possibly 1 bug
that might have caused something to go wrong. There is no guaruntee
there weren't other bugs in there as well. Consider it a fail open solution.

> Some might (reasonably) argue that this would be a bad thing because it
> would reduce the incentive for people to fix bugs if there were less
> pain being inflicted. But given how hard it is to track down these race
> bugs, and that we as a community have no way to force time to be spent
> on them, and that it does not appear that these bugs are causing real
> systems to fall down (only our gating process), perhaps something
> different should be considered?

I really beg to differ on that point. The Infra team will tell you how
terribly unreliable our cloud providers can be at times, hitting many of
the same issues that we expose in elastic recheck.

Lightly loaded / basically static environments will hit some of these
issues at a far lower rate. They are still out there though. Probably
largely ignored through massive retry loops around our stuff.

Allocating a compute server that you can ssh to a dozen times in a test
run shouldn't be considered a moon shot level of function. That's kind
of table stakes for IaaS. :)

And yes, it's hard to debug, but seriously, if the development community
can't figure out why OpenStack doesn't work, can anyone?

	-Sean

-- 
Sean Dague
http://dague.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 478 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140721/00705116/attachment.pgp>


More information about the OpenStack-dev mailing list