[openstack-dev] [gate] The gate: a failure analysis
Monty Taylor
mordred at inaugust.com
Mon Jul 28 21:37:25 UTC 2014
On 07/28/2014 02:32 PM, Angus Lees wrote:
> On Mon, 28 Jul 2014 10:22:07 AM Doug Hellmann wrote:
>> On Jul 28, 2014, at 2:52 AM, Angus Lees <gus at inodes.org> wrote:
>>> On Mon, 21 Jul 2014 04:39:49 PM David Kranz wrote:
>>>> On 07/21/2014 04:13 PM, Jay Pipes wrote:
>>>>> On 07/21/2014 02:03 PM, Clint Byrum wrote:
>>>>>> Thanks Matthew for the analysis.
>>>>>>
>>>>>> I think you missed something though.
>>>>>>
>>>>>> Right now the frustration is that unrelated intermittent bugs stop your
>>>>>> presumably good change from getting in.
>>>>>>
>>>>>> Without gating, the result would be that even more bugs, many of them
>>>>>> not
>>>>>> intermittent at all, would get in. Right now, the one random developer
>>>>>> who has to hunt down the rechecks and do them is inconvenienced. But
>>>>>> without a gate, _every single_ developer will be inconvenienced until
>>>>>> the fix is merged.
>>>>>>
>>>>>> The false negative rate is _way_ too high. Nobody would disagree there.
>>>>>> However, adding more false negatives and allowing more people to ignore
>>>>>> the ones we already have, seems like it would have the opposite effect:
>>>>>> Now instead of annoying the people who hit the random intermittent
>>>>>> bugs,
>>>>>> we'll be annoying _everybody_ as they hit the non-intermittent ones.
>>>>>
>>>>> +10
>>>>
>>>> Right, but perhaps there is a middle ground. We must not allow changes
>>>> in that can't pass through the gate, but we can separate the problems
>>>> of constant rechecks using too many resources, and of constant rechecks
>>>> causing developer pain. If failures were deterministic we would skip the
>>>> failing tests until they were fixed. Unfortunately many of the common
>>>> failures can blow up any test, or even the whole process. Following on
>>>> what Sam said, what if we automatically reran jobs that failed in a
>>>> known way, and disallowed "recheck/reverify no bug"? Developers would
>>>> then have to track down what bug caused a failure or file a new one. But
>>>> they would have to do so much less frequently, and as more common
>>>> failures were catalogued it would become less and less frequent.
>>>>
>>>> Some might (reasonably) argue that this would be a bad thing because it
>>>> would reduce the incentive for people to fix bugs if there were less
>>>> pain being inflicted. But given how hard it is to track down these race
>>>> bugs, and that we as a community have no way to force time to be spent
>>>> on them, and that it does not appear that these bugs are causing real
>>>> systems to fall down (only our gating process), perhaps something
>>>> different should be considered?
>>>
>>> So to pick an example dear to my heart, I've been working on removing
>>> these
>>> gate failures:
>>> http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkxvY2sgd2FpdCB0aW1lb3V0IGV
>>> 4Y2VlZGVkOyB0cnkgcmVzdGFydGluZyB0cmFuc2FjdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc
>>> 2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOns
>>> idXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA2NTI3OTA3NzkzfQ==
>>>
>>> .. caused by a bad interaction between eventlet and our default choice of
>>> mysql driver. It would also affect any real world deployment using mysql.
>>>
>>> The problem has been identified and the fix proposed for almost a month
>>> now, but actually fixing the gate jobs is still no-where in sight. The
>>> fix is (pretty much) as easy as a pip install and a slightly modified
>>> database connection string.
>>> I look forward to a discussion of the meta-issues surrounding this, but it
>>> is not because no-one tracked down or fixed the bug :(
>>
>> I believe the main blocking issue right now is that Oracle doesn’t upload
>> that library to PyPI, and so our build-chain won’t be able to download it
>> as it is currently configured. I think the last I saw someone was going to
>> talk to Oracle about uploading the source. Have we heard back?
>
> Yes, positive conversations are underway and we'll get there eventually. My
> point was also about apparent priorities, however. If addressing gate
> failures was *urgent*, we wouldn't wait for such a conversation to complete
> before making our own workarounds(*). I don't feel we (as a group) are
> sufficiently terrified of false negatives.
>
> (*) Indeed, the affected devstack gate tests install mysqlconnector via
> debs/rpms. I think only the oslo.db "opportunistic tests" talk to mysql via
> pip-installed packages, and these don't also use eventlet.
Honestly, I think devstack installing it from apt/yum is fine.
>>
>> Doug
>>
>>> - Gus
>>>
>>>> -David
>>>>
>>>>> Best,
>>>>> -jay
>>>>>
>>>>>> Excerpts from Matthew Booth's message of 2014-07-21 03:38:07 -0700:
>>>>>>> On Friday evening I had a dependent series of 5 changes all with
>>>>>>> approval waiting to be merged. These were all refactor changes in the
>>>>>>> VMware driver. The changes were:
>>>>>>>
>>>>>>> * VMware: DatastorePath join() and __eq__()
>>>>>>> https://review.openstack.org/#/c/103949/
>>>>>>>
>>>>>>> * VMware: use datastore classes get_allowed_datastores/_sub_folder
>>>>>>> https://review.openstack.org/#/c/103950/
>>>>>>>
>>>>>>> * VMware: use datastore classes in file_move/delete/exists, mkdir
>>>>>>> https://review.openstack.org/#/c/103951/
>>>>>>>
>>>>>>> * VMware: Trivial indentation cleanups in vmops
>>>>>>> https://review.openstack.org/#/c/104149/
>>>>>>>
>>>>>>> * VMware: Convert vmops to use instance as an object
>>>>>>> https://review.openstack.org/#/c/104144/
>>>>>>>
>>>>>>> The last change merged this morning.
>>>>>>>
>>>>>>> In order to merge these changes, over the weekend I manually
>>>>>>> submitted:
>>>>>>>
>>>>>>> * 35 rechecks due to false negatives, an average of 7 per change
>>>>>>> * 19 resubmissions after a change passed, but its dependency did not
>>>>>>>
>>>>>>> Other interesting numbers:
>>>>>>>
>>>>>>> * 16 unique bugs
>>>>>>> * An 87% false negative rate
>>>>>>> * 0 bugs found in the change under test
>>>>>>>
>>>>>>> Because we don't fail fast, that is an average of at least 7.3 hours
>>>>>>> in
>>>>>>> the gate. Much more in fact, because some runs fail on the second
>>>>>>> pass,
>>>>>>> not the first. Because we don't resubmit automatically, that is only
>>>>>>> if
>>>>>>> a developer is actively monitoring the process continuously, and
>>>>>>> resubmits immediately on failure. In practise this is much longer,
>>>>>>> because sometimes we have to sleep.
>>>>>>>
>>>>>>> All of the above numbers are counted from the change receiving an
>>>>>>> approval +2 until final merging. There were far more failures than
>>>>>>> this
>>>>>>> during the approval process.
>>>>>>>
>>>>>>> Why do we test individual changes in the gate? The purpose is to find
>>>>>>> errors *in the change under test*. By the above numbers, it has failed
>>>>>>> to achieve this at least 16 times previously.
>>>>>>>
>>>>>>> Probability of finding a bug in the change under test: Small
>>>>>>> Cost of testing: High
>>>>>>> Opportunity cost of slowing development: High
>>>>>>>
>>>>>>> and for comparison:
>>>>>>>
>>>>>>> Cost of reverting rare false positives: Small
>>>>>>>
>>>>>>> The current process expends a lot of resources, and does not achieve
>>>>>>> its
>>>>>>> goal of finding bugs *in the changes under test*. In addition to
>>>>>>> using a
>>>>>>> lot of technical resources, it also prevents good change from making
>>>>>>> its
>>>>>>> way into the project and, not unimportantly, saps the will to live of
>>>>>>> its victims. The cost of the process is overwhelmingly greater than
>>>>>>> its
>>>>>>> benefits. The gate process as it stands is a significant net
>>>>>>> negative to
>>>>>>> the project.
>>>>>>>
>>>>>>> Does this mean that it is worthless to run these tests? Absolutely
>>>>>>> not!
>>>>>>> These tests are vital to highlight a severe quality deficiency in
>>>>>>> OpenStack. Not addressing this is, imho, an existential risk to the
>>>>>>> project. However, the current approach is to pick contributors from
>>>>>>> the
>>>>>>> community at random and hold them personally responsible for project
>>>>>>> bugs selected at random. Not only has this approach failed, it is
>>>>>>> impractical, unreasonable, and poisonous to the community at large. It
>>>>>>> is also unrelated to the purpose of gate testing, which is to find
>>>>>>> bugs
>>>>>>> *in the changes under test*.
>>>>>>>
>>>>>>> I would like to make the radical proposal that we stop gating on CI
>>>>>>> failures. We will continue to run them on every change, but only after
>>>>>>> the change has been successfully merged.
>>>>>>>
>>>>>>> Benefits:
>>>>>>> * Without rechecks, the gate will use 8 times fewer resources.
>>>>>>> * Log analysis is still available to indicate the emergence of races.
>>>>>>> * Fixes can be merged quicker.
>>>>>>> * Vastly less developer time spent monitoring gate failures.
>>>>>>>
>>>>>>> Costs:
>>>>>>> * A rare class of merge bug will make it into master.
>>>>>>>
>>>>>>> Note that the benefits above will also offset the cost of resolving
>>>>>>> this
>>>>>>> rare class of merge bug.
>>>>>>>
>>>>>>> Of course, we still have the problem of finding resources to monitor
>>>>>>> and
>>>>>>> fix CI failures. An additional benefit of not gating on CI will be
>>>>>>> that
>>>>>>> we can no longer pretend that picking developers for project-affecting
>>>>>>> bugs by lottery is likely to achieve results. As a project we need to
>>>>>>> understand the importance of CI failures. We need a proper negotiation
>>>>>>> with contributors to staff a team dedicated to the problem. We can
>>>>>>> then
>>>>>>> use the review process to ensure that the right people have an
>>>>>>> incentive
>>>>>>> to prioritise bug fixes.
>>>>>>>
>>>>>>> Matt
>>>>>>
>>>>>> _______________________________________________
>>>>>> OpenStack-dev mailing list
>>>>>> OpenStack-dev at lists.openstack.org
>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>
>>>>> _______________________________________________
>>>>> OpenStack-dev mailing list
>>>>> OpenStack-dev at lists.openstack.org
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>
>>>> _______________________________________________
>>>> OpenStack-dev mailing list
>>>> OpenStack-dev at lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
More information about the OpenStack-dev
mailing list