[openstack-dev] Thoughts on the patch test failure rate and moving forward

John Dickinson me at not.mn
Thu Jul 24 22:36:13 UTC 2014


On Jul 24, 2014, at 3:25 PM, Sean Dague <sean at dague.net> wrote:

> On 07/24/2014 06:15 PM, Angus Salkeld wrote:
>> On Wed, 2014-07-23 at 14:39 -0700, James E. Blair wrote:
>>> OpenStack has a substantial CI system that is core to its development
>>> process.  The goals of the system are to facilitate merging good code,
>>> prevent regressions, and ensure that there is at least one configuration
>>> of upstream OpenStack that we know works as a whole.  The "project
>>> gating" technique that we use is effective at preventing many kinds of
>>> regressions from landing, however more subtle, non-deterministic bugs
>>> can still get through, and these are the bugs that are currently
>>> plaguing developers with seemingly random test failures.
>>> 
>>> Most of these bugs are not failures of the test system; they are real
>>> bugs.  Many of them have even been in OpenStack for a long time, but are
>>> only becoming visible now due to improvements in our tests.  That's not
>>> much help to developers whose patches are being hit with negative test
>>> results from unrelated failures.  We need to find a way to address the
>>> non-deterministic bugs that are lurking in OpenStack without making it
>>> easier for new bugs to creep in.
>>> 
>>> The CI system and project infrastructure are not static.  They have
>>> evolved with the project to get to where they are today, and the
>>> challenge now is to continue to evolve them to address the problems
>>> we're seeing now.  The QA and Infrastructure teams recently hosted a
>>> sprint where we discussed some of these issues in depth.  This post from
>>> Sean Dague goes into a bit of the background: [1].  The rest of this
>>> email outlines the medium and long-term changes we would like to make to
>>> address these problems.
>>> 
>>> [1] https://dague.net/2014/07/22/openstack-failures/
>>> 
>>> ==Things we're already doing==
>>> 
>>> The elastic-recheck tool[2] is used to identify "random" failures in
>>> test runs.  It tries to match failures to known bugs using signatures
>>> created from log messages.  It helps developers prioritize bugs by how
>>> frequently they manifest as test failures.  It also collects information
>>> on unclassified errors -- we can see how many (and which) test runs
>>> failed for an unknown reason and our overall progress on finding
>>> fingerprints for random failures.
>>> 
>>> [2] http://status.openstack.org/elastic-recheck/
>>> 
>>> We added a feature to Zuul that lets us manually "promote" changes to
>>> the top of the Gate pipeline.  When the QA team identifies a change that
>>> fixes a bug that is affecting overall gate stability, we can move that
>>> change to the top of the queue so that it may merge more quickly.
>>> 
>>> We added the clean check facility in reaction to the January gate break
>>> down. While it does mean that any individual patch might see more tests
>>> run on it, it's now largely kept the gate queue at a countable number of
>>> hours, instead of regularly growing to more than a work day in
>>> length. It also means that a developer can Approve a code merge before
>>> tests have returned, and not ruin it for everyone else if there turned
>>> out to be a bug that the tests could catch.
>>> 
>>> ==Future changes==
>>> 
>>> ===Communication===
>>> We used to be better at communicating about the CI system.  As it and
>>> the project grew, we incrementally added to our institutional knowledge,
>>> but we haven't been good about maintaining that information in a form
>>> that new or existing contributors can consume to understand what's going
>>> on and why.
>>> 
>>> We have started on a major effort in that direction that we call the
>>> "infra-manual" project -- it's designed to be a comprehensive "user
>>> manual" for the project infrastructure, including the CI process.  Even
>>> before that project is complete, we will write a document that
>>> summarizes the CI system and ensure it is included in new developer
>>> documentation and linked to from test results.
>>> 
>>> There are also a number of ways for people to get involved in the CI
>>> system, whether focused on Infrastructure or QA, but it is not always
>>> clear how to do so.  We will improve our documentation to highlight how
>>> to contribute.
>>> 
>>> ===Fixing Faster===
>>> 
>>> We introduce bugs to OpenStack at some constant rate, which piles up
>>> over time. Our systems currently treat all changes as equally risky and
>>> important to the health of the system, which makes landing code changes
>>> to fix key bugs slow when we're at a high reset rate. We've got a manual
>>> process of promoting changes today to get around this, but that's
>>> actually quite costly in people time, and takes getting all the right
>>> people together at once to promote changes. You can see a number of the
>>> changes we promoted during the gate storm in June [3], and it was no
>>> small number of fixes to get us back to a reasonably passing gate. We
>>> think that optimizing this system will help us land fixes to critical
>>> bugs faster.
>>> 
>>> [3] https://etherpad.openstack.org/p/gatetriage-june2014
>>> 
>>> The basic idea is to use the data from elastic recheck to identify that
>>> a patch is fixing a critical gate related bug. When one of these is
>>> found in the queues it will be given higher priority, including bubbling
>>> up to the top of the gate queue automatically. The manual promote
>>> process should no longer be needed, and instead bugs fixing elastic
>>> recheck tracked issues will be promoted automatically.
>>> 
>>> At the same time we'll also promote review on critical gate bugs through
>>> making them visible in a number of different channels (like on elastic
>>> recheck pages, review day, and in the gerrit dashboards). The idea here
>>> again is to make the reviews that fix key bugs pop to the top of
>>> everyone's views.
>>> 
>>> ===Testing more tactically===
>>> 
>>> One of the challenges that exists today is that we've got basically 2
>>> levels of testing in most of OpenStack: unit tests, and running a whole
>>> OpenStack cloud. Over time we've focused on adding more and more
>>> configurations and tests to the latter, but as we've seen, when things
>>> fail in a whole OpenStack cloud, getting to the root cause is often
>>> quite hard. So hard in fact that most people throw up their hands and
>>> just run 'recheck'. If a test run fails, and no one looks at why, does
>>> it provide any value?
>>> 
>>> We need to get to a balance where we are testing that OpenStack works as
>>> a whole in some configuration, but as we've seen, even our best and
>>> brightest can't seem to make OpenStack reliably boot a compute that has
>>> working networking 100% the time if we happen to be running more than 1
>>> API request at once.
>>> 
>>> Getting there is a multi party process:
>>> 
>>>  * Reduce the gating configurations down to some gold standard
>>>    configuration(s). This will be a small number of configurations that
>>>    we all agree that everything will gate on. This means things like
>>>    postgresql, cells, different environments will all get dropped from
>>>    the gate as we know it.
>>> 
>>>  * Put the burden for a bunch of these tests back on the projects as
>>>    "functional" tests. Basically a custom devstack environment that a
>>>    project can create with a set of services that they minimally need
>>>    to do their job. These functional tests will live in the project
>>>    tree, not in Tempest, so can be atomically landed as part of the
>>>    project normal development process.
>> 
>> We do this in Solum and I really like it. It's nice for the same
>> reviewers to see the functional tests and the code the implements a
>> feature.
>> 
>> One downside is we have had failures due to tempest reworking their
>> client code. This hasn't happened for a while, but it would be good
>> for tempest to recognize that people are using tempest as a library
>> and will maintain API.
> 
> To be clear, the functional tests will not be Tempest tests. This is a
> different class of testing, it's really another tox target that needs a
> devstack to run. A really good initial transition would be things like
> the CLI testing.


I too love this idea. In addition to the current Tempest tests that are run against every patch, Swift has in-tree unit, functional[1], and probe[2] tests. This makes it quite easy to test locally before submitting patches and makes keeping test coverage high much easier too. I'm really happy to hear that this will be the future direction of testing in OpenStack.

[1] Functional tests treat the system as a black box and are looking at whole-system issues. For example, make sure that reads and writes work and that large objects work and that object versioning works. These tests are currently run in the CI gate.

[2] Probe tests treat the system as a white box and test specific interactions between components. For example, in Swift we use probe tests to ensure that replication properly functions when an object server is failed during a write. Or to test that the auditors properly detect file corruption and correct it. These tests aren't currently run in the CI gate, but they are run by reviewers before approving patches.

--John




-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140724/3985fd60/attachment.pgp>


More information about the OpenStack-dev mailing list