[openstack-dev] [gate] The gate: a failure analysis

Doug Hellmann doug at doughellmann.com
Mon Jul 28 14:22:07 UTC 2014


On Jul 28, 2014, at 2:52 AM, Angus Lees <gus at inodes.org> wrote:

> On Mon, 21 Jul 2014 04:39:49 PM David Kranz wrote:
>> On 07/21/2014 04:13 PM, Jay Pipes wrote:
>>> On 07/21/2014 02:03 PM, Clint Byrum wrote:
>>>> Thanks Matthew for the analysis.
>>>> 
>>>> I think you missed something though.
>>>> 
>>>> Right now the frustration is that unrelated intermittent bugs stop your
>>>> presumably good change from getting in.
>>>> 
>>>> Without gating, the result would be that even more bugs, many of them
>>>> not
>>>> intermittent at all, would get in. Right now, the one random developer
>>>> who has to hunt down the rechecks and do them is inconvenienced. But
>>>> without a gate, _every single_ developer will be inconvenienced until
>>>> the fix is merged.
>>>> 
>>>> The false negative rate is _way_ too high. Nobody would disagree there.
>>>> However, adding more false negatives and allowing more people to ignore
>>>> the ones we already have, seems like it would have the opposite effect:
>>>> Now instead of annoying the people who hit the random intermittent bugs,
>>>> we'll be annoying _everybody_ as they hit the non-intermittent ones.
>>> 
>>> +10
>> 
>> Right, but perhaps there is a middle ground. We must not allow changes
>> in that can't pass through the gate, but we can separate the problems
>> of constant rechecks using too many resources, and of constant rechecks
>> causing developer pain. If failures were deterministic we would skip the
>> failing tests until they were fixed. Unfortunately many of the common
>> failures can blow up any test, or even the whole process. Following on
>> what Sam said, what if we automatically reran jobs that failed in a
>> known way, and disallowed "recheck/reverify no bug"? Developers would
>> then have to track down what bug caused a failure or file a new one. But
>> they would have to do so much less frequently, and as more common
>> failures were catalogued it would become less and less frequent.
>> 
>> Some might (reasonably) argue that this would be a bad thing because it
>> would reduce the incentive for people to fix bugs if there were less
>> pain being inflicted. But given how hard it is to track down these race
>> bugs, and that we as a community have no way to force time to be spent
>> on them, and that it does not appear that these bugs are causing real
>> systems to fall down (only our gating process), perhaps something
>> different should be considered?
> 
> So to pick an example dear to my heart, I've been working on removing these 
> gate failures:
> http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkxvY2sgd2FpdCB0aW1lb3V0IGV4Y2VlZGVkOyB0cnkgcmVzdGFydGluZyB0cmFuc2FjdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA2NTI3OTA3NzkzfQ==
> 
> .. caused by a bad interaction between eventlet and our default choice of 
> mysql driver.  It would also affect any real world deployment using mysql.
> 
> The problem has been identified and the fix proposed for almost a month now, but 
> actually fixing the gate jobs is still no-where in sight.  The fix is (pretty 
> much) as easy as a pip install and a slightly modified database connection 
> string.
> I look forward to a discussion of the meta-issues surrounding this, but it is 
> not because no-one tracked down or fixed the bug :(

I believe the main blocking issue right now is that Oracle doesn’t upload that library to PyPI, and so our build-chain won’t be able to download it as it is currently configured. I think the last I saw someone was going to talk to Oracle about uploading the source. Have we heard back?

Doug

> 
> - Gus
> 
>>  -David
>> 
>>> Best,
>>> -jay
>>> 
>>>> Excerpts from Matthew Booth's message of 2014-07-21 03:38:07 -0700:
>>>>> On Friday evening I had a dependent series of 5 changes all with
>>>>> approval waiting to be merged. These were all refactor changes in the
>>>>> VMware driver. The changes were:
>>>>> 
>>>>> * VMware: DatastorePath join() and __eq__()
>>>>> https://review.openstack.org/#/c/103949/
>>>>> 
>>>>> * VMware: use datastore classes get_allowed_datastores/_sub_folder
>>>>> https://review.openstack.org/#/c/103950/
>>>>> 
>>>>> * VMware: use datastore classes in file_move/delete/exists, mkdir
>>>>> https://review.openstack.org/#/c/103951/
>>>>> 
>>>>> * VMware: Trivial indentation cleanups in vmops
>>>>> https://review.openstack.org/#/c/104149/
>>>>> 
>>>>> * VMware: Convert vmops to use instance as an object
>>>>> https://review.openstack.org/#/c/104144/
>>>>> 
>>>>> The last change merged this morning.
>>>>> 
>>>>> In order to merge these changes, over the weekend I manually submitted:
>>>>> 
>>>>> * 35 rechecks due to false negatives, an average of 7 per change
>>>>> * 19 resubmissions after a change passed, but its dependency did not
>>>>> 
>>>>> Other interesting numbers:
>>>>> 
>>>>> * 16 unique bugs
>>>>> * An 87% false negative rate
>>>>> * 0 bugs found in the change under test
>>>>> 
>>>>> Because we don't fail fast, that is an average of at least 7.3 hours in
>>>>> the gate. Much more in fact, because some runs fail on the second pass,
>>>>> not the first. Because we don't resubmit automatically, that is only if
>>>>> a developer is actively monitoring the process continuously, and
>>>>> resubmits immediately on failure. In practise this is much longer,
>>>>> because sometimes we have to sleep.
>>>>> 
>>>>> All of the above numbers are counted from the change receiving an
>>>>> approval +2 until final merging. There were far more failures than this
>>>>> during the approval process.
>>>>> 
>>>>> Why do we test individual changes in the gate? The purpose is to find
>>>>> errors *in the change under test*. By the above numbers, it has failed
>>>>> to achieve this at least 16 times previously.
>>>>> 
>>>>> Probability of finding a bug in the change under test: Small
>>>>> Cost of testing:                                       High
>>>>> Opportunity cost of slowing development:               High
>>>>> 
>>>>> and for comparison:
>>>>> 
>>>>> Cost of reverting rare false positives:                Small
>>>>> 
>>>>> The current process expends a lot of resources, and does not achieve
>>>>> its
>>>>> goal of finding bugs *in the changes under test*. In addition to
>>>>> using a
>>>>> lot of technical resources, it also prevents good change from making
>>>>> its
>>>>> way into the project and, not unimportantly, saps the will to live of
>>>>> its victims. The cost of the process is overwhelmingly greater than its
>>>>> benefits. The gate process as it stands is a significant net
>>>>> negative to
>>>>> the project.
>>>>> 
>>>>> Does this mean that it is worthless to run these tests? Absolutely not!
>>>>> These tests are vital to highlight a severe quality deficiency in
>>>>> OpenStack. Not addressing this is, imho, an existential risk to the
>>>>> project. However, the current approach is to pick contributors from the
>>>>> community at random and hold them personally responsible for project
>>>>> bugs selected at random. Not only has this approach failed, it is
>>>>> impractical, unreasonable, and poisonous to the community at large. It
>>>>> is also unrelated to the purpose of gate testing, which is to find bugs
>>>>> *in the changes under test*.
>>>>> 
>>>>> I would like to make the radical proposal that we stop gating on CI
>>>>> failures. We will continue to run them on every change, but only after
>>>>> the change has been successfully merged.
>>>>> 
>>>>> Benefits:
>>>>> * Without rechecks, the gate will use 8 times fewer resources.
>>>>> * Log analysis is still available to indicate the emergence of races.
>>>>> * Fixes can be merged quicker.
>>>>> * Vastly less developer time spent monitoring gate failures.
>>>>> 
>>>>> Costs:
>>>>> * A rare class of merge bug will make it into master.
>>>>> 
>>>>> Note that the benefits above will also offset the cost of resolving
>>>>> this
>>>>> rare class of merge bug.
>>>>> 
>>>>> Of course, we still have the problem of finding resources to monitor
>>>>> and
>>>>> fix CI failures. An additional benefit of not gating on CI will be that
>>>>> we can no longer pretend that picking developers for project-affecting
>>>>> bugs by lottery is likely to achieve results. As a project we need to
>>>>> understand the importance of CI failures. We need a proper negotiation
>>>>> with contributors to staff a team dedicated to the problem. We can then
>>>>> use the review process to ensure that the right people have an
>>>>> incentive
>>>>> to prioritise bug fixes.
>>>>> 
>>>>> Matt
>>>> 
>>>> _______________________________________________
>>>> OpenStack-dev mailing list
>>>> OpenStack-dev at lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>> 
>>> _______________________________________________
>>> OpenStack-dev mailing list
>>> OpenStack-dev at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> 
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> -- 
> - Gus
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




More information about the OpenStack-dev mailing list