[openstack-dev] Thoughts on the patch test failure rate and moving forward
Matthew Treinish
mtreinish at kortar.org
Thu Jul 24 21:57:16 UTC 2014
On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
> OpenStack has a substantial CI system that is core to its development
> process. The goals of the system are to facilitate merging good code,
> prevent regressions, and ensure that there is at least one configuration
> of upstream OpenStack that we know works as a whole. The "project
> gating" technique that we use is effective at preventing many kinds of
> regressions from landing, however more subtle, non-deterministic bugs
> can still get through, and these are the bugs that are currently
> plaguing developers with seemingly random test failures.
>
> Most of these bugs are not failures of the test system; they are real
> bugs. Many of them have even been in OpenStack for a long time, but are
> only becoming visible now due to improvements in our tests. That's not
> much help to developers whose patches are being hit with negative test
> results from unrelated failures. We need to find a way to address the
> non-deterministic bugs that are lurking in OpenStack without making it
> easier for new bugs to creep in.
>
> The CI system and project infrastructure are not static. They have
> evolved with the project to get to where they are today, and the
> challenge now is to continue to evolve them to address the problems
> we're seeing now. The QA and Infrastructure teams recently hosted a
> sprint where we discussed some of these issues in depth. This post from
> Sean Dague goes into a bit of the background: [1]. The rest of this
> email outlines the medium and long-term changes we would like to make to
> address these problems.
>
> [1] https://dague.net/2014/07/22/openstack-failures/
>
> ==Things we're already doing==
>
> The elastic-recheck tool[2] is used to identify "random" failures in
> test runs. It tries to match failures to known bugs using signatures
> created from log messages. It helps developers prioritize bugs by how
> frequently they manifest as test failures. It also collects information
> on unclassified errors -- we can see how many (and which) test runs
> failed for an unknown reason and our overall progress on finding
> fingerprints for random failures.
>
> [2] http://status.openstack.org/elastic-recheck/
>
> We added a feature to Zuul that lets us manually "promote" changes to
> the top of the Gate pipeline. When the QA team identifies a change that
> fixes a bug that is affecting overall gate stability, we can move that
> change to the top of the queue so that it may merge more quickly.
>
> We added the clean check facility in reaction to the January gate break
> down. While it does mean that any individual patch might see more tests
> run on it, it's now largely kept the gate queue at a countable number of
> hours, instead of regularly growing to more than a work day in
> length. It also means that a developer can Approve a code merge before
> tests have returned, and not ruin it for everyone else if there turned
> out to be a bug that the tests could catch.
>
> ==Future changes==
>
> ===Communication===
> We used to be better at communicating about the CI system. As it and
> the project grew, we incrementally added to our institutional knowledge,
> but we haven't been good about maintaining that information in a form
> that new or existing contributors can consume to understand what's going
> on and why.
>
> We have started on a major effort in that direction that we call the
> "infra-manual" project -- it's designed to be a comprehensive "user
> manual" for the project infrastructure, including the CI process. Even
> before that project is complete, we will write a document that
> summarizes the CI system and ensure it is included in new developer
> documentation and linked to from test results.
>
> There are also a number of ways for people to get involved in the CI
> system, whether focused on Infrastructure or QA, but it is not always
> clear how to do so. We will improve our documentation to highlight how
> to contribute.
>
> ===Fixing Faster===
>
> We introduce bugs to OpenStack at some constant rate, which piles up
> over time. Our systems currently treat all changes as equally risky and
> important to the health of the system, which makes landing code changes
> to fix key bugs slow when we're at a high reset rate. We've got a manual
> process of promoting changes today to get around this, but that's
> actually quite costly in people time, and takes getting all the right
> people together at once to promote changes. You can see a number of the
> changes we promoted during the gate storm in June [3], and it was no
> small number of fixes to get us back to a reasonably passing gate. We
> think that optimizing this system will help us land fixes to critical
> bugs faster.
>
> [3] https://etherpad.openstack.org/p/gatetriage-june2014
>
> The basic idea is to use the data from elastic recheck to identify that
> a patch is fixing a critical gate related bug. When one of these is
> found in the queues it will be given higher priority, including bubbling
> up to the top of the gate queue automatically. The manual promote
> process should no longer be needed, and instead bugs fixing elastic
> recheck tracked issues will be promoted automatically.
>
> At the same time we'll also promote review on critical gate bugs through
> making them visible in a number of different channels (like on elastic
> recheck pages, review day, and in the gerrit dashboards). The idea here
> again is to make the reviews that fix key bugs pop to the top of
> everyone's views.
>
> ===Testing more tactically===
>
> One of the challenges that exists today is that we've got basically 2
> levels of testing in most of OpenStack: unit tests, and running a whole
> OpenStack cloud. Over time we've focused on adding more and more
> configurations and tests to the latter, but as we've seen, when things
> fail in a whole OpenStack cloud, getting to the root cause is often
> quite hard. So hard in fact that most people throw up their hands and
> just run 'recheck'. If a test run fails, and no one looks at why, does
> it provide any value?
>
> We need to get to a balance where we are testing that OpenStack works as
> a whole in some configuration, but as we've seen, even our best and
> brightest can't seem to make OpenStack reliably boot a compute that has
> working networking 100% the time if we happen to be running more than 1
> API request at once.
>
> Getting there is a multi party process:
>
> * Reduce the gating configurations down to some gold standard
> configuration(s). This will be a small number of configurations that
> we all agree that everything will gate on. This means things like
> postgresql, cells, different environments will all get dropped from
> the gate as we know it.
>
> * Put the burden for a bunch of these tests back on the projects as
> "functional" tests. Basically a custom devstack environment that a
> project can create with a set of services that they minimally need
> to do their job. These functional tests will live in the project
> tree, not in Tempest, so can be atomically landed as part of the
> project normal development process.
>
> * For all non gold standard configurations, we'll dedicate a part of
> our infrastructure to running them in a continuous background loop,
> as well as making these configs available as experimental jobs. The
> idea here is that we'll actually be able to provide more
> configurations that are operating in a more traditional CI (post
> merge) context. People that are interested in keeping these bits
> functional can monitor those jobs and help with fixes when needed.
> The experimental jobs mean that if developers are concerned about
> the effect of a particular change on one of these configs, it's easy
> to request a pre-merge test run. In the near term we might imagine
> this would allow for things like ceph, mongodb, docker, and possibly
> very new libvirt to be validated in some way upstream.
>
> * Provide some kind of easy to view dashboards of these jobs, as well
> as a policy that if some job is failing for > some period of time,
> it's removed from the system. We want to provide whatever feedback
> we can to engaged parties, but people do need to realize that
> engagement is key. The biggest part of putting tests into OpenStack
> isn't landing the tests, but dealing with their failures.
>
> * Encourage projects to specifically land interface tests in other
> projects when they depend on certain behavior.
So I think we (or least I do) need clarification around this item. My question
is which interfaces are we depending on that need these specific types of
tests? Projects shouldn't be depending on another project's unstable interfaces.
If specific behavior is required for a cross-project interaction it should be
part of defined stable API, hopefully the REST API, and then that behavior
should be enforced for everyone not just the cross-project interaction.
If I'm interpreting this correctly the what is actually needed here is to
actually ensure that there is test coverage somewhere for the APIs that should
already be tested where there is a cross-project dependency. This is actually
the same thing we see all the time because there is a lack of test coverage
on certain APIs that are being used. (the nova default quotas example comes to
mind) I just think calling this a special class of test is a bit misleading.
Since it shouldn't actually differ than any other API test. Or am I missing
something?
>
> Let's imagine an example of how this works in the real world.
>
> * The heat-slow job is deleted.
>
> * The heat team creates a specific functional job which tests some of
> their deeper function in Heat, all the tests live in Heat, and
> because of these the tests can include white/grey box testing of the
> DB and queues while things are progressing.
>
> * Nova lands a change which neither Tempest or our configs exercise,
> but breaks Heat.
>
> * The Heat project can now decide if it's more important to keep the
> test in place (preventing them from landing code), or to skip it to
> get back to work.
>
> * The Heat team then works on the right fix for Nova, or communicates
> with the Nova team on the issue at hand. The fix to Nova *also*
> should include tests which locks down that interface so that Nova
> won't break them again in the future (the ironic team did this with
> their test_ironic_contract patch). These tests could be unit tests,
> if they are testable that way, or functional tests in the Nova tree.
The one thing I want to point out here is that ironic_contract test should be
an exception, I don't feel that we want that to be the norm. It's not a good
example for a few reasons, mostly around the fact that ironic tree depends on
the purposefully unstable nova driver api as temporary measure until the ironic
driver is merged into the nova tree. The contract api tests will go away once
the driver is in the nova tree. It should not be necessary for something over
the REST API, since the contact should be enforced through tempest. (even under
this new model, I expect this to still be true)
There was that comment which someone (I can't remember who) brought up at the
Portland summit that tempest is acting like double book accounting for the api
contract, and that has been something we've seen as extremely valuable
historically. Which is why I don't want to see this aspect of tempest's role in
the gate altered.
Although, all I think we actually need is an api definition for testing in an
external repo, just to prevent inadvertent changes. (whether that gets used in
tempest or not) So another alternative I see here is something that I've started
to outline in [4] to address the potential for code duplication and effort in
the new functional test suites. If all the project specific functional tests are
using clients from an external functional testing library repo then this concern
goes away.
Now, if something like this example were to be exposed because of a coverage
gap I think it's fair game to have a specific test in nova's functional test
suite. But, I also think there should be an external audit of that API somewhere
too. Ideally I think what I'd like to see is probably a write-once test
graduation procedure for moving appropriate things into tempest (or somewhere
else) from the project specific functional tests. Basically like what we
discussed during Maru's summit session on Neutron functional testing in Atlanta.
For the other, more social, goal of this step in fostering communication between
the projects and not using QA and/or Infra as a middle man I fully support. I
agree that we probably have too proxying going on between projects using QA
and/or infra instead of necessarily talking directly.
>
> * The Heat team then is back in business.
>
> This approach brings more control of when a project is blocked back into
> their own project. Tempest remains a final integration test to ensure
> that basics of the whole stack work together, but each project has a
> vertical testing stack which is specific to them as well.
>
> ==Final thoughts==
>
> The current rate of test failures and subsequent rechecks is not
> sustainable in the long term. It's not good for contributors,
> reveiewers, or the overall project quality. While these bugs do need to
> be addressed, it's unlikely that the current process will cause that to
> happen. Instead, we want to push more substantial testing into the
> projects themselves with functional and interface testing, and depend
> less on devstack-gate integration tests to catch all bugs. This should
> help us catch bugs closer to the source and in an environment where
> debugging is easier. We also want to reduce the scope of devstack gate
> tests to a gold standard while running tests of other configurations in
> a traditional CI process so that people interested in those
> configurations can focus on ensuring they work.
>
I apologize if this reply is a bit disorderly with too much raw thought. Having
been in Darmstadt and helping come up with this approach I definitely agree with
this as a good future direction. There are just some smaller details that I
think need to be worked out before we all run off and start implementing.
Specifically, I feel like there is some hand waving around what these project
specific functional test suites entail and how they should co-exist with
tempest. Which is something we didn't get a chance to address in detail during
the mid-cycle. But, that's mostly a minor discussion compared to the overall
social and policy changes that are being proposed here. I'll probably spin that
off as a separate ML thread at some point so we can clarify these points, so as
not to distract from the broader discussion.
-Matt Treinish
[4] https://review.openstack.org/#/c/108858
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140724/5b26ca51/attachment.pgp>
More information about the OpenStack-dev
mailing list