[openstack-dev] [gate] The gate: a failure analysis

David Kranz dkranz at redhat.com
Mon Jul 21 20:39:49 UTC 2014


On 07/21/2014 04:13 PM, Jay Pipes wrote:
> On 07/21/2014 02:03 PM, Clint Byrum wrote:
>> Thanks Matthew for the analysis.
>>
>> I think you missed something though.
>>
>> Right now the frustration is that unrelated intermittent bugs stop your
>> presumably good change from getting in.
>>
>> Without gating, the result would be that even more bugs, many of them 
>> not
>> intermittent at all, would get in. Right now, the one random developer
>> who has to hunt down the rechecks and do them is inconvenienced. But
>> without a gate, _every single_ developer will be inconvenienced until
>> the fix is merged.
>>
>> The false negative rate is _way_ too high. Nobody would disagree there.
>> However, adding more false negatives and allowing more people to ignore
>> the ones we already have, seems like it would have the opposite effect:
>> Now instead of annoying the people who hit the random intermittent bugs,
>> we'll be annoying _everybody_ as they hit the non-intermittent ones.
>
> +10
>
Right, but perhaps there is a middle ground. We must not allow changes 
in that can't pass through the gate, but we can separate the problems
of constant rechecks using too many resources, and of constant rechecks 
causing developer pain. If failures were deterministic we would skip the 
failing tests until they were fixed. Unfortunately many of the common 
failures can blow up any test, or even the whole process. Following on 
what Sam said, what if we automatically reran jobs that failed in a 
known way, and disallowed "recheck/reverify no bug"? Developers would 
then have to track down what bug caused a failure or file a new one. But 
they would have to do so much less frequently, and as more common 
failures were catalogued it would become less and less frequent.

Some might (reasonably) argue that this would be a bad thing because it 
would reduce the incentive for people to fix bugs if there were less 
pain being inflicted. But given how hard it is to track down these race 
bugs, and that we as a community have no way to force time to be spent 
on them, and that it does not appear that these bugs are causing real 
systems to fall down (only our gating process), perhaps something 
different should be considered?

  -David

> Best,
> -jay
>
>> Excerpts from Matthew Booth's message of 2014-07-21 03:38:07 -0700:
>>> On Friday evening I had a dependent series of 5 changes all with
>>> approval waiting to be merged. These were all refactor changes in the
>>> VMware driver. The changes were:
>>>
>>> * VMware: DatastorePath join() and __eq__()
>>> https://review.openstack.org/#/c/103949/
>>>
>>> * VMware: use datastore classes get_allowed_datastores/_sub_folder
>>> https://review.openstack.org/#/c/103950/
>>>
>>> * VMware: use datastore classes in file_move/delete/exists, mkdir
>>> https://review.openstack.org/#/c/103951/
>>>
>>> * VMware: Trivial indentation cleanups in vmops
>>> https://review.openstack.org/#/c/104149/
>>>
>>> * VMware: Convert vmops to use instance as an object
>>> https://review.openstack.org/#/c/104144/
>>>
>>> The last change merged this morning.
>>>
>>> In order to merge these changes, over the weekend I manually submitted:
>>>
>>> * 35 rechecks due to false negatives, an average of 7 per change
>>> * 19 resubmissions after a change passed, but its dependency did not
>>>
>>> Other interesting numbers:
>>>
>>> * 16 unique bugs
>>> * An 87% false negative rate
>>> * 0 bugs found in the change under test
>>>
>>> Because we don't fail fast, that is an average of at least 7.3 hours in
>>> the gate. Much more in fact, because some runs fail on the second pass,
>>> not the first. Because we don't resubmit automatically, that is only if
>>> a developer is actively monitoring the process continuously, and
>>> resubmits immediately on failure. In practise this is much longer,
>>> because sometimes we have to sleep.
>>>
>>> All of the above numbers are counted from the change receiving an
>>> approval +2 until final merging. There were far more failures than this
>>> during the approval process.
>>>
>>> Why do we test individual changes in the gate? The purpose is to find
>>> errors *in the change under test*. By the above numbers, it has failed
>>> to achieve this at least 16 times previously.
>>>
>>> Probability of finding a bug in the change under test: Small
>>> Cost of testing:                                       High
>>> Opportunity cost of slowing development:               High
>>>
>>> and for comparison:
>>>
>>> Cost of reverting rare false positives:                Small
>>>
>>> The current process expends a lot of resources, and does not achieve 
>>> its
>>> goal of finding bugs *in the changes under test*. In addition to 
>>> using a
>>> lot of technical resources, it also prevents good change from making 
>>> its
>>> way into the project and, not unimportantly, saps the will to live of
>>> its victims. The cost of the process is overwhelmingly greater than its
>>> benefits. The gate process as it stands is a significant net 
>>> negative to
>>> the project.
>>>
>>> Does this mean that it is worthless to run these tests? Absolutely not!
>>> These tests are vital to highlight a severe quality deficiency in
>>> OpenStack. Not addressing this is, imho, an existential risk to the
>>> project. However, the current approach is to pick contributors from the
>>> community at random and hold them personally responsible for project
>>> bugs selected at random. Not only has this approach failed, it is
>>> impractical, unreasonable, and poisonous to the community at large. It
>>> is also unrelated to the purpose of gate testing, which is to find bugs
>>> *in the changes under test*.
>>>
>>> I would like to make the radical proposal that we stop gating on CI
>>> failures. We will continue to run them on every change, but only after
>>> the change has been successfully merged.
>>>
>>> Benefits:
>>> * Without rechecks, the gate will use 8 times fewer resources.
>>> * Log analysis is still available to indicate the emergence of races.
>>> * Fixes can be merged quicker.
>>> * Vastly less developer time spent monitoring gate failures.
>>>
>>> Costs:
>>> * A rare class of merge bug will make it into master.
>>>
>>> Note that the benefits above will also offset the cost of resolving 
>>> this
>>> rare class of merge bug.
>>>
>>> Of course, we still have the problem of finding resources to monitor 
>>> and
>>> fix CI failures. An additional benefit of not gating on CI will be that
>>> we can no longer pretend that picking developers for project-affecting
>>> bugs by lottery is likely to achieve results. As a project we need to
>>> understand the importance of CI failures. We need a proper negotiation
>>> with contributors to staff a team dedicated to the problem. We can then
>>> use the review process to ensure that the right people have an 
>>> incentive
>>> to prioritise bug fixes.
>>>
>>> Matt
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




More information about the OpenStack-dev mailing list