[openstack-dev] Thoughts on the patch test failure rate and moving forward
Angus Salkeld
angus.salkeld at RACKSPACE.COM
Thu Jul 24 22:15:31 UTC 2014
On Wed, 2014-07-23 at 14:39 -0700, James E. Blair wrote:
> OpenStack has a substantial CI system that is core to its development
> process. The goals of the system are to facilitate merging good code,
> prevent regressions, and ensure that there is at least one configuration
> of upstream OpenStack that we know works as a whole. The "project
> gating" technique that we use is effective at preventing many kinds of
> regressions from landing, however more subtle, non-deterministic bugs
> can still get through, and these are the bugs that are currently
> plaguing developers with seemingly random test failures.
>
> Most of these bugs are not failures of the test system; they are real
> bugs. Many of them have even been in OpenStack for a long time, but are
> only becoming visible now due to improvements in our tests. That's not
> much help to developers whose patches are being hit with negative test
> results from unrelated failures. We need to find a way to address the
> non-deterministic bugs that are lurking in OpenStack without making it
> easier for new bugs to creep in.
>
> The CI system and project infrastructure are not static. They have
> evolved with the project to get to where they are today, and the
> challenge now is to continue to evolve them to address the problems
> we're seeing now. The QA and Infrastructure teams recently hosted a
> sprint where we discussed some of these issues in depth. This post from
> Sean Dague goes into a bit of the background: [1]. The rest of this
> email outlines the medium and long-term changes we would like to make to
> address these problems.
>
> [1] https://dague.net/2014/07/22/openstack-failures/
>
> ==Things we're already doing==
>
> The elastic-recheck tool[2] is used to identify "random" failures in
> test runs. It tries to match failures to known bugs using signatures
> created from log messages. It helps developers prioritize bugs by how
> frequently they manifest as test failures. It also collects information
> on unclassified errors -- we can see how many (and which) test runs
> failed for an unknown reason and our overall progress on finding
> fingerprints for random failures.
>
> [2] http://status.openstack.org/elastic-recheck/
>
> We added a feature to Zuul that lets us manually "promote" changes to
> the top of the Gate pipeline. When the QA team identifies a change that
> fixes a bug that is affecting overall gate stability, we can move that
> change to the top of the queue so that it may merge more quickly.
>
> We added the clean check facility in reaction to the January gate break
> down. While it does mean that any individual patch might see more tests
> run on it, it's now largely kept the gate queue at a countable number of
> hours, instead of regularly growing to more than a work day in
> length. It also means that a developer can Approve a code merge before
> tests have returned, and not ruin it for everyone else if there turned
> out to be a bug that the tests could catch.
>
> ==Future changes==
>
> ===Communication===
> We used to be better at communicating about the CI system. As it and
> the project grew, we incrementally added to our institutional knowledge,
> but we haven't been good about maintaining that information in a form
> that new or existing contributors can consume to understand what's going
> on and why.
>
> We have started on a major effort in that direction that we call the
> "infra-manual" project -- it's designed to be a comprehensive "user
> manual" for the project infrastructure, including the CI process. Even
> before that project is complete, we will write a document that
> summarizes the CI system and ensure it is included in new developer
> documentation and linked to from test results.
>
> There are also a number of ways for people to get involved in the CI
> system, whether focused on Infrastructure or QA, but it is not always
> clear how to do so. We will improve our documentation to highlight how
> to contribute.
>
> ===Fixing Faster===
>
> We introduce bugs to OpenStack at some constant rate, which piles up
> over time. Our systems currently treat all changes as equally risky and
> important to the health of the system, which makes landing code changes
> to fix key bugs slow when we're at a high reset rate. We've got a manual
> process of promoting changes today to get around this, but that's
> actually quite costly in people time, and takes getting all the right
> people together at once to promote changes. You can see a number of the
> changes we promoted during the gate storm in June [3], and it was no
> small number of fixes to get us back to a reasonably passing gate. We
> think that optimizing this system will help us land fixes to critical
> bugs faster.
>
> [3] https://etherpad.openstack.org/p/gatetriage-june2014
>
> The basic idea is to use the data from elastic recheck to identify that
> a patch is fixing a critical gate related bug. When one of these is
> found in the queues it will be given higher priority, including bubbling
> up to the top of the gate queue automatically. The manual promote
> process should no longer be needed, and instead bugs fixing elastic
> recheck tracked issues will be promoted automatically.
>
> At the same time we'll also promote review on critical gate bugs through
> making them visible in a number of different channels (like on elastic
> recheck pages, review day, and in the gerrit dashboards). The idea here
> again is to make the reviews that fix key bugs pop to the top of
> everyone's views.
>
> ===Testing more tactically===
>
> One of the challenges that exists today is that we've got basically 2
> levels of testing in most of OpenStack: unit tests, and running a whole
> OpenStack cloud. Over time we've focused on adding more and more
> configurations and tests to the latter, but as we've seen, when things
> fail in a whole OpenStack cloud, getting to the root cause is often
> quite hard. So hard in fact that most people throw up their hands and
> just run 'recheck'. If a test run fails, and no one looks at why, does
> it provide any value?
>
> We need to get to a balance where we are testing that OpenStack works as
> a whole in some configuration, but as we've seen, even our best and
> brightest can't seem to make OpenStack reliably boot a compute that has
> working networking 100% the time if we happen to be running more than 1
> API request at once.
>
> Getting there is a multi party process:
>
> * Reduce the gating configurations down to some gold standard
> configuration(s). This will be a small number of configurations that
> we all agree that everything will gate on. This means things like
> postgresql, cells, different environments will all get dropped from
> the gate as we know it.
>
> * Put the burden for a bunch of these tests back on the projects as
> "functional" tests. Basically a custom devstack environment that a
> project can create with a set of services that they minimally need
> to do their job. These functional tests will live in the project
> tree, not in Tempest, so can be atomically landed as part of the
> project normal development process.
We do this in Solum and I really like it. It's nice for the same
reviewers to see the functional tests and the code the implements a
feature.
One downside is we have had failures due to tempest reworking their
client code. This hasn't happened for a while, but it would be good
for tempest to recognize that people are using tempest as a library
and will maintain API.
-Angus
>
> * For all non gold standard configurations, we'll dedicate a part of
> our infrastructure to running them in a continuous background loop,
> as well as making these configs available as experimental jobs. The
> idea here is that we'll actually be able to provide more
> configurations that are operating in a more traditional CI (post
> merge) context. People that are interested in keeping these bits
> functional can monitor those jobs and help with fixes when needed.
> The experimental jobs mean that if developers are concerned about
> the effect of a particular change on one of these configs, it's easy
> to request a pre-merge test run. In the near term we might imagine
> this would allow for things like ceph, mongodb, docker, and possibly
> very new libvirt to be validated in some way upstream.
>
> * Provide some kind of easy to view dashboards of these jobs, as well
> as a policy that if some job is failing for > some period of time,
> it's removed from the system. We want to provide whatever feedback
> we can to engaged parties, but people do need to realize that
> engagement is key. The biggest part of putting tests into OpenStack
> isn't landing the tests, but dealing with their failures.
>
> * Encourage projects to specifically land interface tests in other
> projects when they depend on certain behavior.
>
> Let's imagine an example of how this works in the real world.
>
> * The heat-slow job is deleted.
>
> * The heat team creates a specific functional job which tests some of
> their deeper function in Heat, all the tests live in Heat, and
> because of these the tests can include white/grey box testing of the
> DB and queues while things are progressing.
>
> * Nova lands a change which neither Tempest or our configs exercise,
> but breaks Heat.
>
> * The Heat project can now decide if it's more important to keep the
> test in place (preventing them from landing code), or to skip it to
> get back to work.
>
> * The Heat team then works on the right fix for Nova, or communicates
> with the Nova team on the issue at hand. The fix to Nova *also*
> should include tests which locks down that interface so that Nova
> won't break them again in the future (the ironic team did this with
> their test_ironic_contract patch). These tests could be unit tests,
> if they are testable that way, or functional tests in the Nova tree.
>
> * The Heat team then is back in business.
>
> This approach brings more control of when a project is blocked back into
> their own project. Tempest remains a final integration test to ensure
> that basics of the whole stack work together, but each project has a
> vertical testing stack which is specific to them as well.
>
> ==Final thoughts==
>
> The current rate of test failures and subsequent rechecks is not
> sustainable in the long term. It's not good for contributors,
> reveiewers, or the overall project quality. While these bugs do need to
> be addressed, it's unlikely that the current process will cause that to
> happen. Instead, we want to push more substantial testing into the
> projects themselves with functional and interface testing, and depend
> less on devstack-gate integration tests to catch all bugs. This should
> help us catch bugs closer to the source and in an environment where
> debugging is easier. We also want to reduce the scope of devstack gate
> tests to a gold standard while running tests of other configurations in
> a traditional CI process so that people interested in those
> configurations can focus on ensuring they work.
>
> Thanks,
>
> Jim and Sean
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140724/0dad1a12/attachment.pgp>
More information about the OpenStack-dev
mailing list