[openstack-dev] Thoughts on the patch test failure rate and moving forward
Matthew Treinish
mtreinish at kortar.org
Fri Jul 25 20:20:37 UTC 2014
On Thu, Jul 24, 2014 at 06:54:38PM -0400, Sean Dague wrote:
> On 07/24/2014 05:57 PM, Matthew Treinish wrote:
> > On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
> >> OpenStack has a substantial CI system that is core to its development
> >> process. The goals of the system are to facilitate merging good code,
> >> prevent regressions, and ensure that there is at least one configuration
> >> of upstream OpenStack that we know works as a whole. The "project
> >> gating" technique that we use is effective at preventing many kinds of
> >> regressions from landing, however more subtle, non-deterministic bugs
> >> can still get through, and these are the bugs that are currently
> >> plaguing developers with seemingly random test failures.
> >>
> >> Most of these bugs are not failures of the test system; they are real
> >> bugs. Many of them have even been in OpenStack for a long time, but are
> >> only becoming visible now due to improvements in our tests. That's not
> >> much help to developers whose patches are being hit with negative test
> >> results from unrelated failures. We need to find a way to address the
> >> non-deterministic bugs that are lurking in OpenStack without making it
> >> easier for new bugs to creep in.
> >>
> >> The CI system and project infrastructure are not static. They have
> >> evolved with the project to get to where they are today, and the
> >> challenge now is to continue to evolve them to address the problems
> >> we're seeing now. The QA and Infrastructure teams recently hosted a
> >> sprint where we discussed some of these issues in depth. This post from
> >> Sean Dague goes into a bit of the background: [1]. The rest of this
> >> email outlines the medium and long-term changes we would like to make to
> >> address these problems.
> >>
> >> [1] https://dague.net/2014/07/22/openstack-failures/
> >>
> >> ==Things we're already doing==
> >>
> >> The elastic-recheck tool[2] is used to identify "random" failures in
> >> test runs. It tries to match failures to known bugs using signatures
> >> created from log messages. It helps developers prioritize bugs by how
> >> frequently they manifest as test failures. It also collects information
> >> on unclassified errors -- we can see how many (and which) test runs
> >> failed for an unknown reason and our overall progress on finding
> >> fingerprints for random failures.
> >>
> >> [2] http://status.openstack.org/elastic-recheck/
> >>
> >> We added a feature to Zuul that lets us manually "promote" changes to
> >> the top of the Gate pipeline. When the QA team identifies a change that
> >> fixes a bug that is affecting overall gate stability, we can move that
> >> change to the top of the queue so that it may merge more quickly.
> >>
> >> We added the clean check facility in reaction to the January gate break
> >> down. While it does mean that any individual patch might see more tests
> >> run on it, it's now largely kept the gate queue at a countable number of
> >> hours, instead of regularly growing to more than a work day in
> >> length. It also means that a developer can Approve a code merge before
> >> tests have returned, and not ruin it for everyone else if there turned
> >> out to be a bug that the tests could catch.
> >>
> >> ==Future changes==
> >>
> >> ===Communication===
> >> We used to be better at communicating about the CI system. As it and
> >> the project grew, we incrementally added to our institutional knowledge,
> >> but we haven't been good about maintaining that information in a form
> >> that new or existing contributors can consume to understand what's going
> >> on and why.
> >>
> >> We have started on a major effort in that direction that we call the
> >> "infra-manual" project -- it's designed to be a comprehensive "user
> >> manual" for the project infrastructure, including the CI process. Even
> >> before that project is complete, we will write a document that
> >> summarizes the CI system and ensure it is included in new developer
> >> documentation and linked to from test results.
> >>
> >> There are also a number of ways for people to get involved in the CI
> >> system, whether focused on Infrastructure or QA, but it is not always
> >> clear how to do so. We will improve our documentation to highlight how
> >> to contribute.
> >>
> >> ===Fixing Faster===
> >>
> >> We introduce bugs to OpenStack at some constant rate, which piles up
> >> over time. Our systems currently treat all changes as equally risky and
> >> important to the health of the system, which makes landing code changes
> >> to fix key bugs slow when we're at a high reset rate. We've got a manual
> >> process of promoting changes today to get around this, but that's
> >> actually quite costly in people time, and takes getting all the right
> >> people together at once to promote changes. You can see a number of the
> >> changes we promoted during the gate storm in June [3], and it was no
> >> small number of fixes to get us back to a reasonably passing gate. We
> >> think that optimizing this system will help us land fixes to critical
> >> bugs faster.
> >>
> >> [3] https://etherpad.openstack.org/p/gatetriage-june2014
> >>
> >> The basic idea is to use the data from elastic recheck to identify that
> >> a patch is fixing a critical gate related bug. When one of these is
> >> found in the queues it will be given higher priority, including bubbling
> >> up to the top of the gate queue automatically. The manual promote
> >> process should no longer be needed, and instead bugs fixing elastic
> >> recheck tracked issues will be promoted automatically.
> >>
> >> At the same time we'll also promote review on critical gate bugs through
> >> making them visible in a number of different channels (like on elastic
> >> recheck pages, review day, and in the gerrit dashboards). The idea here
> >> again is to make the reviews that fix key bugs pop to the top of
> >> everyone's views.
> >>
> >> ===Testing more tactically===
> >>
> >> One of the challenges that exists today is that we've got basically 2
> >> levels of testing in most of OpenStack: unit tests, and running a whole
> >> OpenStack cloud. Over time we've focused on adding more and more
> >> configurations and tests to the latter, but as we've seen, when things
> >> fail in a whole OpenStack cloud, getting to the root cause is often
> >> quite hard. So hard in fact that most people throw up their hands and
> >> just run 'recheck'. If a test run fails, and no one looks at why, does
> >> it provide any value?
> >>
> >> We need to get to a balance where we are testing that OpenStack works as
> >> a whole in some configuration, but as we've seen, even our best and
> >> brightest can't seem to make OpenStack reliably boot a compute that has
> >> working networking 100% the time if we happen to be running more than 1
> >> API request at once.
> >>
> >> Getting there is a multi party process:
> >>
> >> * Reduce the gating configurations down to some gold standard
> >> configuration(s). This will be a small number of configurations that
> >> we all agree that everything will gate on. This means things like
> >> postgresql, cells, different environments will all get dropped from
> >> the gate as we know it.
> >>
> >> * Put the burden for a bunch of these tests back on the projects as
> >> "functional" tests. Basically a custom devstack environment that a
> >> project can create with a set of services that they minimally need
> >> to do their job. These functional tests will live in the project
> >> tree, not in Tempest, so can be atomically landed as part of the
> >> project normal development process.
> >>
> >> * For all non gold standard configurations, we'll dedicate a part of
> >> our infrastructure to running them in a continuous background loop,
> >> as well as making these configs available as experimental jobs. The
> >> idea here is that we'll actually be able to provide more
> >> configurations that are operating in a more traditional CI (post
> >> merge) context. People that are interested in keeping these bits
> >> functional can monitor those jobs and help with fixes when needed.
> >> The experimental jobs mean that if developers are concerned about
> >> the effect of a particular change on one of these configs, it's easy
> >> to request a pre-merge test run. In the near term we might imagine
> >> this would allow for things like ceph, mongodb, docker, and possibly
> >> very new libvirt to be validated in some way upstream.
> >>
> >> * Provide some kind of easy to view dashboards of these jobs, as well
> >> as a policy that if some job is failing for > some period of time,
> >> it's removed from the system. We want to provide whatever feedback
> >> we can to engaged parties, but people do need to realize that
> >> engagement is key. The biggest part of putting tests into OpenStack
> >> isn't landing the tests, but dealing with their failures.
> >>
> >> * Encourage projects to specifically land interface tests in other
> >> projects when they depend on certain behavior.
> >
> > So I think we (or least I do) need clarification around this item. My question
> > is which interfaces are we depending on that need these specific types of
> > tests? Projects shouldn't be depending on another project's unstable interfaces.
> > If specific behavior is required for a cross-project interaction it should be
> > part of defined stable API, hopefully the REST API, and then that behavior
> > should be enforced for everyone not just the cross-project interaction.
> >
> > If I'm interpreting this correctly the what is actually needed here is to
> > actually ensure that there is test coverage somewhere for the APIs that should
> > already be tested where there is a cross-project dependency. This is actually
> > the same thing we see all the time because there is a lack of test coverage
> > on certain APIs that are being used. (the nova default quotas example comes to
> > mind) I just think calling this a special class of test is a bit misleading.
> > Since it shouldn't actually differ than any other API test. Or am I missing
> > something?
>
> Projects are consuming the behavior of other projects far beyond just
> the formal REST APIs. Notifications is another great instance of that.
Yeah, ok besides the Ironic example and notifications are there any others,
because it honestly seems to me like we're probably bikeshedding on a special
case thing that shouldn't come up very often. I especially don't think it's
something we want to be encouraging. I think Nova and Neutron did it correctly
when they added an events REST API instead of doing some sort of notification
or other mechanism under the covers.
>
> This is also more of a pragmatic organic approach to figuring out the
> interfaces we need to lock down. When one projects breaks depending on
> an interface in another project, that should trigger this kind of
> contract growth, which hopefully formally turns into a document later
> for a stable interface.
So notifications are a good example of this, but I think how we handled this
is also an example of what not to do. The order was backwards, there should
have been a stability guarantee upfront, with a versioning mechanism on
notifications when another project started relying on using them. The fact that
there are at least 2 ML threads on how to fix and test this at this point in
ceilometer's life seems like a poor way to handle it. I don't want to see us
repeat this by allowing cross-project interactions to depend on unstable
interfaces.
Maybe having contract tests would force the growth to be more orderly so we
don't have these issues in the future, I'm not sure. But this a separate policy
discussion, and probably a TC level thing to flesh out.
>
> >> Let's imagine an example of how this works in the real world.
> >>
> >> * The heat-slow job is deleted.
> >>
> >> * The heat team creates a specific functional job which tests some of
> >> their deeper function in Heat, all the tests live in Heat, and
> >> because of these the tests can include white/grey box testing of the
> >> DB and queues while things are progressing.
> >>
> >> * Nova lands a change which neither Tempest or our configs exercise,
> >> but breaks Heat.
> >>
> >> * The Heat project can now decide if it's more important to keep the
> >> test in place (preventing them from landing code), or to skip it to
> >> get back to work.
> >>
> >> * The Heat team then works on the right fix for Nova, or communicates
> >> with the Nova team on the issue at hand. The fix to Nova *also*
> >> should include tests which locks down that interface so that Nova
> >> won't break them again in the future (the ironic team did this with
> >> their test_ironic_contract patch). These tests could be unit tests,
> >> if they are testable that way, or functional tests in the Nova tree.
> >
> > The one thing I want to point out here is that ironic_contract test should be
> > an exception, I don't feel that we want that to be the norm. It's not a good
> > example for a few reasons, mostly around the fact that ironic tree depends on
> > the purposefully unstable nova driver api as temporary measure until the ironic
> > driver is merged into the nova tree. The contract api tests will go away once
> > the driver is in the nova tree. It should not be necessary for something over
> > the REST API, since the contact should be enforced through tempest. (even under
> > this new model, I expect this to still be true)
> >
> > There was that comment which someone (I can't remember who) brought up at the
> > Portland summit that tempest is acting like double book accounting for the api
> > contract, and that has been something we've seen as extremely valuable
> > historically. Which is why I don't want to see this aspect of tempest's role in
> > the gate altered.
>
> I've been the holder of the double book accounting pov in the past.
> However, after the last six months of fragility, I just don't see how
> that's a sustainable point of view. The QA team remains somewhat
> constant size, and the number of interfaces and projects grows at a good
> clip.
I agree that there is a scaling issue, our variable testing quality and coverage
between all the projects in the tempest tree is proof enough of this. I just
don't want to see us lose the protection we have against inadvertent changes.
Having the friction of something like the tempest two step is important, we've
blocked a lot of breaking api changes because of this.
The other thing to consider is that when we adopted branchless tempest part of
the goal there was to ensure the consistency between release boundaries. If
we're really advocating dropping most of the API coverage out of tempest part of
the story needs to be around how we prevent things from slipping between release
boundaries too.
>
> > Although, all I think we actually need is an api definition for testing in an
> > external repo, just to prevent inadvertent changes. (whether that gets used in
> > tempest or not) So another alternative I see here is something that I've started
> > to outline in [4] to address the potential for code duplication and effort in
> > the new functional test suites. If all the project specific functional tests are
> > using clients from an external functional testing library repo then this concern
> > goes away.
>
> Actually, I don't think these wold be using external clients. This is in
> tree testing.
>
> This will definitely be an experiment to get the API testing closer to
> the source. That being said, Swift really has done this fine for a long
> time, and I think we need to revisit the premise that projects can't be
> trusted.
My idea was just to have a library that as part of it exposes what is essential
Tempest's current rest client. The tests would still live in the project trees
it would be to just to prevent duplication of effort part of the QA program
would maintain a common functional test library. I was just suggesting that if
we moved the tests that we expect to ensure api stability into project specific
functional test suites we could maintain the protection easily by limiting those
tests to the library's service clients. Honestly, I don't think this is the
right approach either, it was just an alternative I thought of.
But, having worked on this stuff for ~2 years I can say from personal experience
that every project slips when it comes to API stability, despite the best
intentions, unless there was test coverage for it. I don't want to see us open
the flood gates on this just because we've gotten ourselves into a bad situation
with the state of the gate.
>
> > Now, if something like this example were to be exposed because of a coverage
> > gap I think it's fair game to have a specific test in nova's functional test
> > suite. But, I also think there should be an external audit of that API somewhere
> > too. Ideally I think what I'd like to see is probably a write-once test
> > graduation procedure for moving appropriate things into tempest (or somewhere
> > else) from the project specific functional tests. Basically like what we
> > discussed during Maru's summit session on Neutron functional testing in Atlanta.
>
> Right, and I think basically we shouldn't graduate most of those tests.
> They are neutron tests, in the neutron tree. A few key ones we decide
> should be run outside that context.
Yes, I agree there should be a strict criteria and extra scrutiny for doing
this, but I still want to have an outlined procedure for doing this before doing
it. (in either direction) But, we don't necessarily have to figure that out
right away. We can grow the functional test suites first, I don't think having
some duplication is that big a deal.
>
> > For the other, more social, goal of this step in fostering communication between
> > the projects and not using QA and/or Infra as a middle man I fully support. I
> > agree that we probably have too proxying going on between projects using QA
> > and/or infra instead of necessarily talking directly.
>
> Our current model leans far too much on the idea of the only time we
> ever try to test things for real is when we throw all 1 million lines of
> source code into one pot and stir. It really shouldn't be surprising how
> many bugs shake out there. And this is the wrong layer to debug from, so
> I firmly believe we need to change this back to something we can
> actually manage to shake the bugs out with. Because right now we're
> finding them, but our infrastructure isn't optimized for fixing them,
> and we need to change that.
>
I agree a layered approach is best, I'm not disagreeing on that point. I just am
not sure how much we really should be decreasing the scope of Tempest as the top
layer around the api tests. I don't think we should too much just because we
beefing up the middle with improved functional testing. In my view having some
duplication between the layers is fine and desirable actually.
Anyway, I feel like I'm diverging this thread off into a different area, so I'll
shoot off a separate thread on the topic of scale and scope of Tempest and the
new in-tree project specific functional tests. But to summarize, what I think we
should be clear about at the high level for this thread is that for right now is
that for the short term we aren't changing the scope of Tempest. Instead we
should just be vigilant in managing tempest's growth (which we've been trying to
do already) We can revisit the discussion of decreasing Tempest's size once
everyone's figured out the per project functional testing. This will also give
us time to collect longer term data about test stability in the gate so we can
figure out which things are actually valuable to have in tempest. I think this
is what probably got lost in the noise here but has been discussed elsewhere.
-Matt Treinish
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140725/02711e4b/attachment.pgp>
More information about the OpenStack-dev
mailing list