[openstack-dev] [gate] The gate: a failure analysis

Matthew Booth mbooth at redhat.com
Mon Jul 21 10:38:07 UTC 2014


On Friday evening I had a dependent series of 5 changes all with
approval waiting to be merged. These were all refactor changes in the
VMware driver. The changes were:

* VMware: DatastorePath join() and __eq__()
https://review.openstack.org/#/c/103949/

* VMware: use datastore classes get_allowed_datastores/_sub_folder
https://review.openstack.org/#/c/103950/

* VMware: use datastore classes in file_move/delete/exists, mkdir
https://review.openstack.org/#/c/103951/

* VMware: Trivial indentation cleanups in vmops
https://review.openstack.org/#/c/104149/

* VMware: Convert vmops to use instance as an object
https://review.openstack.org/#/c/104144/

The last change merged this morning.

In order to merge these changes, over the weekend I manually submitted:

* 35 rechecks due to false negatives, an average of 7 per change
* 19 resubmissions after a change passed, but its dependency did not

Other interesting numbers:

* 16 unique bugs
* An 87% false negative rate
* 0 bugs found in the change under test

Because we don't fail fast, that is an average of at least 7.3 hours in
the gate. Much more in fact, because some runs fail on the second pass,
not the first. Because we don't resubmit automatically, that is only if
a developer is actively monitoring the process continuously, and
resubmits immediately on failure. In practise this is much longer,
because sometimes we have to sleep.

All of the above numbers are counted from the change receiving an
approval +2 until final merging. There were far more failures than this
during the approval process.

Why do we test individual changes in the gate? The purpose is to find
errors *in the change under test*. By the above numbers, it has failed
to achieve this at least 16 times previously.

Probability of finding a bug in the change under test: Small
Cost of testing:                                       High
Opportunity cost of slowing development:               High

and for comparison:

Cost of reverting rare false positives:                Small

The current process expends a lot of resources, and does not achieve its
goal of finding bugs *in the changes under test*. In addition to using a
lot of technical resources, it also prevents good change from making its
way into the project and, not unimportantly, saps the will to live of
its victims. The cost of the process is overwhelmingly greater than its
benefits. The gate process as it stands is a significant net negative to
the project.

Does this mean that it is worthless to run these tests? Absolutely not!
These tests are vital to highlight a severe quality deficiency in
OpenStack. Not addressing this is, imho, an existential risk to the
project. However, the current approach is to pick contributors from the
community at random and hold them personally responsible for project
bugs selected at random. Not only has this approach failed, it is
impractical, unreasonable, and poisonous to the community at large. It
is also unrelated to the purpose of gate testing, which is to find bugs
*in the changes under test*.

I would like to make the radical proposal that we stop gating on CI
failures. We will continue to run them on every change, but only after
the change has been successfully merged.

Benefits:
* Without rechecks, the gate will use 8 times fewer resources.
* Log analysis is still available to indicate the emergence of races.
* Fixes can be merged quicker.
* Vastly less developer time spent monitoring gate failures.

Costs:
* A rare class of merge bug will make it into master.

Note that the benefits above will also offset the cost of resolving this
rare class of merge bug.

Of course, we still have the problem of finding resources to monitor and
fix CI failures. An additional benefit of not gating on CI will be that
we can no longer pretend that picking developers for project-affecting
bugs by lottery is likely to achieve results. As a project we need to
understand the importance of CI failures. We need a proper negotiation
with contributors to staff a team dedicated to the problem. We can then
use the review process to ensure that the right people have an incentive
to prioritise bug fixes.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490



More information about the OpenStack-dev mailing list