[openstack-dev] [gate] The gate: a failure analysis

Jay Pipes jaypipes at gmail.com
Mon Jul 21 20:13:14 UTC 2014


On 07/21/2014 02:03 PM, Clint Byrum wrote:
> Thanks Matthew for the analysis.
>
> I think you missed something though.
>
> Right now the frustration is that unrelated intermittent bugs stop your
> presumably good change from getting in.
>
> Without gating, the result would be that even more bugs, many of them not
> intermittent at all, would get in. Right now, the one random developer
> who has to hunt down the rechecks and do them is inconvenienced. But
> without a gate, _every single_ developer will be inconvenienced until
> the fix is merged.
>
> The false negative rate is _way_ too high. Nobody would disagree there.
> However, adding more false negatives and allowing more people to ignore
> the ones we already have, seems like it would have the opposite effect:
> Now instead of annoying the people who hit the random intermittent bugs,
> we'll be annoying _everybody_ as they hit the non-intermittent ones.

+10

Best,
-jay

> Excerpts from Matthew Booth's message of 2014-07-21 03:38:07 -0700:
>> On Friday evening I had a dependent series of 5 changes all with
>> approval waiting to be merged. These were all refactor changes in the
>> VMware driver. The changes were:
>>
>> * VMware: DatastorePath join() and __eq__()
>> https://review.openstack.org/#/c/103949/
>>
>> * VMware: use datastore classes get_allowed_datastores/_sub_folder
>> https://review.openstack.org/#/c/103950/
>>
>> * VMware: use datastore classes in file_move/delete/exists, mkdir
>> https://review.openstack.org/#/c/103951/
>>
>> * VMware: Trivial indentation cleanups in vmops
>> https://review.openstack.org/#/c/104149/
>>
>> * VMware: Convert vmops to use instance as an object
>> https://review.openstack.org/#/c/104144/
>>
>> The last change merged this morning.
>>
>> In order to merge these changes, over the weekend I manually submitted:
>>
>> * 35 rechecks due to false negatives, an average of 7 per change
>> * 19 resubmissions after a change passed, but its dependency did not
>>
>> Other interesting numbers:
>>
>> * 16 unique bugs
>> * An 87% false negative rate
>> * 0 bugs found in the change under test
>>
>> Because we don't fail fast, that is an average of at least 7.3 hours in
>> the gate. Much more in fact, because some runs fail on the second pass,
>> not the first. Because we don't resubmit automatically, that is only if
>> a developer is actively monitoring the process continuously, and
>> resubmits immediately on failure. In practise this is much longer,
>> because sometimes we have to sleep.
>>
>> All of the above numbers are counted from the change receiving an
>> approval +2 until final merging. There were far more failures than this
>> during the approval process.
>>
>> Why do we test individual changes in the gate? The purpose is to find
>> errors *in the change under test*. By the above numbers, it has failed
>> to achieve this at least 16 times previously.
>>
>> Probability of finding a bug in the change under test: Small
>> Cost of testing:                                       High
>> Opportunity cost of slowing development:               High
>>
>> and for comparison:
>>
>> Cost of reverting rare false positives:                Small
>>
>> The current process expends a lot of resources, and does not achieve its
>> goal of finding bugs *in the changes under test*. In addition to using a
>> lot of technical resources, it also prevents good change from making its
>> way into the project and, not unimportantly, saps the will to live of
>> its victims. The cost of the process is overwhelmingly greater than its
>> benefits. The gate process as it stands is a significant net negative to
>> the project.
>>
>> Does this mean that it is worthless to run these tests? Absolutely not!
>> These tests are vital to highlight a severe quality deficiency in
>> OpenStack. Not addressing this is, imho, an existential risk to the
>> project. However, the current approach is to pick contributors from the
>> community at random and hold them personally responsible for project
>> bugs selected at random. Not only has this approach failed, it is
>> impractical, unreasonable, and poisonous to the community at large. It
>> is also unrelated to the purpose of gate testing, which is to find bugs
>> *in the changes under test*.
>>
>> I would like to make the radical proposal that we stop gating on CI
>> failures. We will continue to run them on every change, but only after
>> the change has been successfully merged.
>>
>> Benefits:
>> * Without rechecks, the gate will use 8 times fewer resources.
>> * Log analysis is still available to indicate the emergence of races.
>> * Fixes can be merged quicker.
>> * Vastly less developer time spent monitoring gate failures.
>>
>> Costs:
>> * A rare class of merge bug will make it into master.
>>
>> Note that the benefits above will also offset the cost of resolving this
>> rare class of merge bug.
>>
>> Of course, we still have the problem of finding resources to monitor and
>> fix CI failures. An additional benefit of not gating on CI will be that
>> we can no longer pretend that picking developers for project-affecting
>> bugs by lottery is likely to achieve results. As a project we need to
>> understand the importance of CI failures. We need a proper negotiation
>> with contributors to staff a team dedicated to the problem. We can then
>> use the review process to ensure that the right people have an incentive
>> to prioritise bug fixes.
>>
>> Matt
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>




More information about the OpenStack-dev mailing list