[openstack-dev] Thoughts on the patch test failure rate and moving forward
James E. Blair
corvus at inaugust.com
Wed Jul 23 21:39:47 UTC 2014
OpenStack has a substantial CI system that is core to its development
process. The goals of the system are to facilitate merging good code,
prevent regressions, and ensure that there is at least one configuration
of upstream OpenStack that we know works as a whole. The "project
gating" technique that we use is effective at preventing many kinds of
regressions from landing, however more subtle, non-deterministic bugs
can still get through, and these are the bugs that are currently
plaguing developers with seemingly random test failures.
Most of these bugs are not failures of the test system; they are real
bugs. Many of them have even been in OpenStack for a long time, but are
only becoming visible now due to improvements in our tests. That's not
much help to developers whose patches are being hit with negative test
results from unrelated failures. We need to find a way to address the
non-deterministic bugs that are lurking in OpenStack without making it
easier for new bugs to creep in.
The CI system and project infrastructure are not static. They have
evolved with the project to get to where they are today, and the
challenge now is to continue to evolve them to address the problems
we're seeing now. The QA and Infrastructure teams recently hosted a
sprint where we discussed some of these issues in depth. This post from
Sean Dague goes into a bit of the background: . The rest of this
email outlines the medium and long-term changes we would like to make to
address these problems.
==Things we're already doing==
The elastic-recheck tool is used to identify "random" failures in
test runs. It tries to match failures to known bugs using signatures
created from log messages. It helps developers prioritize bugs by how
frequently they manifest as test failures. It also collects information
on unclassified errors -- we can see how many (and which) test runs
failed for an unknown reason and our overall progress on finding
fingerprints for random failures.
We added a feature to Zuul that lets us manually "promote" changes to
the top of the Gate pipeline. When the QA team identifies a change that
fixes a bug that is affecting overall gate stability, we can move that
change to the top of the queue so that it may merge more quickly.
We added the clean check facility in reaction to the January gate break
down. While it does mean that any individual patch might see more tests
run on it, it's now largely kept the gate queue at a countable number of
hours, instead of regularly growing to more than a work day in
length. It also means that a developer can Approve a code merge before
tests have returned, and not ruin it for everyone else if there turned
out to be a bug that the tests could catch.
We used to be better at communicating about the CI system. As it and
the project grew, we incrementally added to our institutional knowledge,
but we haven't been good about maintaining that information in a form
that new or existing contributors can consume to understand what's going
on and why.
We have started on a major effort in that direction that we call the
"infra-manual" project -- it's designed to be a comprehensive "user
manual" for the project infrastructure, including the CI process. Even
before that project is complete, we will write a document that
summarizes the CI system and ensure it is included in new developer
documentation and linked to from test results.
There are also a number of ways for people to get involved in the CI
system, whether focused on Infrastructure or QA, but it is not always
clear how to do so. We will improve our documentation to highlight how
We introduce bugs to OpenStack at some constant rate, which piles up
over time. Our systems currently treat all changes as equally risky and
important to the health of the system, which makes landing code changes
to fix key bugs slow when we're at a high reset rate. We've got a manual
process of promoting changes today to get around this, but that's
actually quite costly in people time, and takes getting all the right
people together at once to promote changes. You can see a number of the
changes we promoted during the gate storm in June , and it was no
small number of fixes to get us back to a reasonably passing gate. We
think that optimizing this system will help us land fixes to critical
The basic idea is to use the data from elastic recheck to identify that
a patch is fixing a critical gate related bug. When one of these is
found in the queues it will be given higher priority, including bubbling
up to the top of the gate queue automatically. The manual promote
process should no longer be needed, and instead bugs fixing elastic
recheck tracked issues will be promoted automatically.
At the same time we'll also promote review on critical gate bugs through
making them visible in a number of different channels (like on elastic
recheck pages, review day, and in the gerrit dashboards). The idea here
again is to make the reviews that fix key bugs pop to the top of
===Testing more tactically===
One of the challenges that exists today is that we've got basically 2
levels of testing in most of OpenStack: unit tests, and running a whole
OpenStack cloud. Over time we've focused on adding more and more
configurations and tests to the latter, but as we've seen, when things
fail in a whole OpenStack cloud, getting to the root cause is often
quite hard. So hard in fact that most people throw up their hands and
just run 'recheck'. If a test run fails, and no one looks at why, does
it provide any value?
We need to get to a balance where we are testing that OpenStack works as
a whole in some configuration, but as we've seen, even our best and
brightest can't seem to make OpenStack reliably boot a compute that has
working networking 100% the time if we happen to be running more than 1
API request at once.
Getting there is a multi party process:
* Reduce the gating configurations down to some gold standard
configuration(s). This will be a small number of configurations that
we all agree that everything will gate on. This means things like
postgresql, cells, different environments will all get dropped from
the gate as we know it.
* Put the burden for a bunch of these tests back on the projects as
"functional" tests. Basically a custom devstack environment that a
project can create with a set of services that they minimally need
to do their job. These functional tests will live in the project
tree, not in Tempest, so can be atomically landed as part of the
project normal development process.
* For all non gold standard configurations, we'll dedicate a part of
our infrastructure to running them in a continuous background loop,
as well as making these configs available as experimental jobs. The
idea here is that we'll actually be able to provide more
configurations that are operating in a more traditional CI (post
merge) context. People that are interested in keeping these bits
functional can monitor those jobs and help with fixes when needed.
The experimental jobs mean that if developers are concerned about
the effect of a particular change on one of these configs, it's easy
to request a pre-merge test run. In the near term we might imagine
this would allow for things like ceph, mongodb, docker, and possibly
very new libvirt to be validated in some way upstream.
* Provide some kind of easy to view dashboards of these jobs, as well
as a policy that if some job is failing for > some period of time,
it's removed from the system. We want to provide whatever feedback
we can to engaged parties, but people do need to realize that
engagement is key. The biggest part of putting tests into OpenStack
isn't landing the tests, but dealing with their failures.
* Encourage projects to specifically land interface tests in other
projects when they depend on certain behavior.
Let's imagine an example of how this works in the real world.
* The heat-slow job is deleted.
* The heat team creates a specific functional job which tests some of
their deeper function in Heat, all the tests live in Heat, and
because of these the tests can include white/grey box testing of the
DB and queues while things are progressing.
* Nova lands a change which neither Tempest or our configs exercise,
but breaks Heat.
* The Heat project can now decide if it's more important to keep the
test in place (preventing them from landing code), or to skip it to
get back to work.
* The Heat team then works on the right fix for Nova, or communicates
with the Nova team on the issue at hand. The fix to Nova *also*
should include tests which locks down that interface so that Nova
won't break them again in the future (the ironic team did this with
their test_ironic_contract patch). These tests could be unit tests,
if they are testable that way, or functional tests in the Nova tree.
* The Heat team then is back in business.
This approach brings more control of when a project is blocked back into
their own project. Tempest remains a final integration test to ensure
that basics of the whole stack work together, but each project has a
vertical testing stack which is specific to them as well.
The current rate of test failures and subsequent rechecks is not
sustainable in the long term. It's not good for contributors,
reveiewers, or the overall project quality. While these bugs do need to
be addressed, it's unlikely that the current process will cause that to
happen. Instead, we want to push more substantial testing into the
projects themselves with functional and interface testing, and depend
less on devstack-gate integration tests to catch all bugs. This should
help us catch bugs closer to the source and in an environment where
debugging is easier. We also want to reduce the scope of devstack gate
tests to a gold standard while running tests of other configurations in
a traditional CI process so that people interested in those
configurations can focus on ensuring they work.
Jim and Sean
More information about the OpenStack-dev