<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Jul 24, 2014 at 3:54 PM, Sean Dague <span dir="ltr"><<a href="mailto:sean@dague.net" target="_blank">sean@dague.net</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On 07/24/2014 05:57 PM, Matthew Treinish wrote:<br>
> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:<br>
>> OpenStack has a substantial CI system that is core to its development<br>
>> process. The goals of the system are to facilitate merging good code,<br>
>> prevent regressions, and ensure that there is at least one configuration<br>
>> of upstream OpenStack that we know works as a whole. The "project<br>
>> gating" technique that we use is effective at preventing many kinds of<br>
>> regressions from landing, however more subtle, non-deterministic bugs<br>
>> can still get through, and these are the bugs that are currently<br>
>> plaguing developers with seemingly random test failures.<br>
>><br>
>> Most of these bugs are not failures of the test system; they are real<br>
>> bugs. Many of them have even been in OpenStack for a long time, but are<br>
>> only becoming visible now due to improvements in our tests. That's not<br>
>> much help to developers whose patches are being hit with negative test<br>
>> results from unrelated failures. We need to find a way to address the<br>
>> non-deterministic bugs that are lurking in OpenStack without making it<br>
>> easier for new bugs to creep in.<br>
>><br>
>> The CI system and project infrastructure are not static. They have<br>
>> evolved with the project to get to where they are today, and the<br>
>> challenge now is to continue to evolve them to address the problems<br>
>> we're seeing now. The QA and Infrastructure teams recently hosted a<br>
>> sprint where we discussed some of these issues in depth. This post from<br>
>> Sean Dague goes into a bit of the background: [1]. The rest of this<br>
>> email outlines the medium and long-term changes we would like to make to<br>
>> address these problems.<br>
>><br>
>> [1] <a href="https://dague.net/2014/07/22/openstack-failures/" target="_blank">https://dague.net/2014/07/22/openstack-failures/</a><br>
>><br>
>> ==Things we're already doing==<br>
>><br>
>> The elastic-recheck tool[2] is used to identify "random" failures in<br>
>> test runs. It tries to match failures to known bugs using signatures<br>
>> created from log messages. It helps developers prioritize bugs by how<br>
>> frequently they manifest as test failures. It also collects information<br>
>> on unclassified errors -- we can see how many (and which) test runs<br>
>> failed for an unknown reason and our overall progress on finding<br>
>> fingerprints for random failures.<br>
>><br>
>> [2] <a href="http://status.openstack.org/elastic-recheck/" target="_blank">http://status.openstack.org/elastic-recheck/</a><br>
>><br>
>> We added a feature to Zuul that lets us manually "promote" changes to<br>
>> the top of the Gate pipeline. When the QA team identifies a change that<br>
>> fixes a bug that is affecting overall gate stability, we can move that<br>
>> change to the top of the queue so that it may merge more quickly.<br>
>><br>
>> We added the clean check facility in reaction to the January gate break<br>
>> down. While it does mean that any individual patch might see more tests<br>
>> run on it, it's now largely kept the gate queue at a countable number of<br>
>> hours, instead of regularly growing to more than a work day in<br>
>> length. It also means that a developer can Approve a code merge before<br>
>> tests have returned, and not ruin it for everyone else if there turned<br>
>> out to be a bug that the tests could catch.<br>
>><br>
>> ==Future changes==<br>
>><br>
>> ===Communication===<br>
>> We used to be better at communicating about the CI system. As it and<br>
>> the project grew, we incrementally added to our institutional knowledge,<br>
>> but we haven't been good about maintaining that information in a form<br>
>> that new or existing contributors can consume to understand what's going<br>
>> on and why.<br>
>><br>
>> We have started on a major effort in that direction that we call the<br>
>> "infra-manual" project -- it's designed to be a comprehensive "user<br>
>> manual" for the project infrastructure, including the CI process. Even<br>
>> before that project is complete, we will write a document that<br>
>> summarizes the CI system and ensure it is included in new developer<br>
>> documentation and linked to from test results.<br>
>><br>
>> There are also a number of ways for people to get involved in the CI<br>
>> system, whether focused on Infrastructure or QA, but it is not always<br>
>> clear how to do so. We will improve our documentation to highlight how<br>
>> to contribute.<br>
>><br>
>> ===Fixing Faster===<br>
>><br>
>> We introduce bugs to OpenStack at some constant rate, which piles up<br>
>> over time. Our systems currently treat all changes as equally risky and<br>
>> important to the health of the system, which makes landing code changes<br>
>> to fix key bugs slow when we're at a high reset rate. We've got a manual<br>
>> process of promoting changes today to get around this, but that's<br>
>> actually quite costly in people time, and takes getting all the right<br>
>> people together at once to promote changes. You can see a number of the<br>
>> changes we promoted during the gate storm in June [3], and it was no<br>
>> small number of fixes to get us back to a reasonably passing gate. We<br>
>> think that optimizing this system will help us land fixes to critical<br>
>> bugs faster.<br>
>><br>
>> [3] <a href="https://etherpad.openstack.org/p/gatetriage-june2014" target="_blank">https://etherpad.openstack.org/p/gatetriage-june2014</a><br>
>><br>
>> The basic idea is to use the data from elastic recheck to identify that<br>
>> a patch is fixing a critical gate related bug. When one of these is<br>
>> found in the queues it will be given higher priority, including bubbling<br>
>> up to the top of the gate queue automatically. The manual promote<br>
>> process should no longer be needed, and instead bugs fixing elastic<br>
>> recheck tracked issues will be promoted automatically.<br>
>><br>
>> At the same time we'll also promote review on critical gate bugs through<br>
>> making them visible in a number of different channels (like on elastic<br>
>> recheck pages, review day, and in the gerrit dashboards). The idea here<br>
>> again is to make the reviews that fix key bugs pop to the top of<br>
>> everyone's views.<br>
>><br>
>> ===Testing more tactically===<br>
>><br>
>> One of the challenges that exists today is that we've got basically 2<br>
>> levels of testing in most of OpenStack: unit tests, and running a whole<br>
>> OpenStack cloud. Over time we've focused on adding more and more<br>
>> configurations and tests to the latter, but as we've seen, when things<br>
>> fail in a whole OpenStack cloud, getting to the root cause is often<br>
>> quite hard. So hard in fact that most people throw up their hands and<br>
>> just run 'recheck'. If a test run fails, and no one looks at why, does<br>
>> it provide any value?<br>
>><br>
>> We need to get to a balance where we are testing that OpenStack works as<br>
>> a whole in some configuration, but as we've seen, even our best and<br>
>> brightest can't seem to make OpenStack reliably boot a compute that has<br>
>> working networking 100% the time if we happen to be running more than 1<br>
>> API request at once.<br>
>><br>
>> Getting there is a multi party process:<br>
>><br>
>> * Reduce the gating configurations down to some gold standard<br>
>> configuration(s). This will be a small number of configurations that<br>
>> we all agree that everything will gate on. This means things like<br>
>> postgresql, cells, different environments will all get dropped from<br>
>> the gate as we know it.<br>
>><br>
>> * Put the burden for a bunch of these tests back on the projects as<br>
>> "functional" tests. Basically a custom devstack environment that a<br>
>> project can create with a set of services that they minimally need<br>
>> to do their job. These functional tests will live in the project<br>
>> tree, not in Tempest, so can be atomically landed as part of the<br>
>> project normal development process.<br>
>><br>
>> * For all non gold standard configurations, we'll dedicate a part of<br>
>> our infrastructure to running them in a continuous background loop,<br>
>> as well as making these configs available as experimental jobs. The<br>
>> idea here is that we'll actually be able to provide more<br>
>> configurations that are operating in a more traditional CI (post<br>
>> merge) context. People that are interested in keeping these bits<br>
>> functional can monitor those jobs and help with fixes when needed.<br>
>> The experimental jobs mean that if developers are concerned about<br>
>> the effect of a particular change on one of these configs, it's easy<br>
>> to request a pre-merge test run. In the near term we might imagine<br>
>> this would allow for things like ceph, mongodb, docker, and possibly<br>
>> very new libvirt to be validated in some way upstream.<br>
>><br>
>> * Provide some kind of easy to view dashboards of these jobs, as well<br>
>> as a policy that if some job is failing for > some period of time,<br>
>> it's removed from the system. We want to provide whatever feedback<br>
>> we can to engaged parties, but people do need to realize that<br>
>> engagement is key. The biggest part of putting tests into OpenStack<br>
>> isn't landing the tests, but dealing with their failures.<br>
>><br>
>> * Encourage projects to specifically land interface tests in other<br>
>> projects when they depend on certain behavior.<br>
><br>
> So I think we (or least I do) need clarification around this item. My question<br>
> is which interfaces are we depending on that need these specific types of<br>
> tests? Projects shouldn't be depending on another project's unstable interfaces.<br>
> If specific behavior is required for a cross-project interaction it should be<br>
> part of defined stable API, hopefully the REST API, and then that behavior<br>
> should be enforced for everyone not just the cross-project interaction.<br>
><br>
> If I'm interpreting this correctly the what is actually needed here is to<br>
> actually ensure that there is test coverage somewhere for the APIs that should<br>
> already be tested where there is a cross-project dependency. This is actually<br>
> the same thing we see all the time because there is a lack of test coverage<br>
> on certain APIs that are being used. (the nova default quotas example comes to<br>
> mind) I just think calling this a special class of test is a bit misleading.<br>
> Since it shouldn't actually differ than any other API test. Or am I missing<br>
> something?<br>
<br>
</div></div>Projects are consuming the behavior of other projects far beyond just<br>
the formal REST APIs. Notifications is another great instance of that.<br></blockquote><div><br></div><div>I think the fact that notification aren't versioned or really 'contractual' but are being used as such to be a huge issue.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
This is also more of a pragmatic organic approach to figuring out the<br>
interfaces we need to lock down. When one projects breaks depending on<br>
an interface in another project, that should trigger this kind of<br>
contract growth, which hopefully formally turns into a document later<br>
for a stable interface.<br></blockquote><div><br></div><div>This approach sounds like a recipe for us playing a never ending game of catch up, when issues like this have arrisen in the past we rarely get around to creating a stable interface (see notifications). I would rather push on using stable APIs (and creating them as needed) instead.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><div class="h5"><br>
>> Let's imagine an example of how this works in the real world.<br>
>><br>
>> * The heat-slow job is deleted.<br>
>><br>
>> * The heat team creates a specific functional job which tests some of<br>
>> their deeper function in Heat, all the tests live in Heat, and<br>
>> because of these the tests can include white/grey box testing of the<br>
>> DB and queues while things are progressing.<br>
>><br>
>> * Nova lands a change which neither Tempest or our configs exercise,<br>
>> but breaks Heat.<br>
>><br>
>> * The Heat project can now decide if it's more important to keep the<br>
>> test in place (preventing them from landing code), or to skip it to<br>
>> get back to work.<br>
>><br>
>> * The Heat team then works on the right fix for Nova, or communicates<br>
>> with the Nova team on the issue at hand. The fix to Nova *also*<br>
>> should include tests which locks down that interface so that Nova<br>
>> won't break them again in the future (the ironic team did this with<br>
>> their test_ironic_contract patch). These tests could be unit tests,<br>
>> if they are testable that way, or functional tests in the Nova tree.<br>
><br>
> The one thing I want to point out here is that ironic_contract test should be<br>
> an exception, I don't feel that we want that to be the norm. It's not a good<br>
> example for a few reasons, mostly around the fact that ironic tree depends on<br>
> the purposefully unstable nova driver api as temporary measure until the ironic<br>
> driver is merged into the nova tree. The contract api tests will go away once<br>
> the driver is in the nova tree. It should not be necessary for something over<br>
> the REST API, since the contact should be enforced through tempest. (even under<br>
> this new model, I expect this to still be true)<br>
><br>
> There was that comment which someone (I can't remember who) brought up at the<br>
> Portland summit that tempest is acting like double book accounting for the api<br>
> contract, and that has been something we've seen as extremely valuable<br>
> historically. Which is why I don't want to see this aspect of tempest's role in<br>
> the gate altered.<br>
<br>
</div></div>I've been the holder of the double book accounting pov in the past.<br>
However, after the last six months of fragility, I just don't see how<br>
that's a sustainable point of view. The QA team remains somewhat<br>
constant size, and the number of interfaces and projects grows at a good<br>
clip.<br>
<div class=""><br>
> Although, all I think we actually need is an api definition for testing in an<br>
> external repo, just to prevent inadvertent changes. (whether that gets used in<br>
> tempest or not) So another alternative I see here is something that I've started<br>
> to outline in [4] to address the potential for code duplication and effort in<br>
> the new functional test suites. If all the project specific functional tests are<br>
> using clients from an external functional testing library repo then this concern<br>
> goes away.<br>
<br>
</div>Actually, I don't think these wold be using external clients. This is in<br>
tree testing.<br>
<br>
This will definitely be an experiment to get the API testing closer to<br>
the source. That being said, Swift really has done this fine for a long<br>
time, and I think we need to revisit the premise that projects can't be<br>
trusted.<br>
<div class=""><br>
> Now, if something like this example were to be exposed because of a coverage<br>
> gap I think it's fair game to have a specific test in nova's functional test<br>
> suite. But, I also think there should be an external audit of that API somewhere<br>
> too. Ideally I think what I'd like to see is probably a write-once test<br>
> graduation procedure for moving appropriate things into tempest (or somewhere<br>
> else) from the project specific functional tests. Basically like what we<br>
> discussed during Maru's summit session on Neutron functional testing in Atlanta.<br>
<br>
</div>Right, and I think basically we shouldn't graduate most of those tests.<br>
They are neutron tests, in the neutron tree. A few key ones we decide<br>
should be run outside that context.<br>
<div class=""><br>
> For the other, more social, goal of this step in fostering communication between<br>
> the projects and not using QA and/or Infra as a middle man I fully support. I<br>
> agree that we probably have too proxying going on between projects using QA<br>
> and/or infra instead of necessarily talking directly.<br>
<br>
</div>Our current model leans far too much on the idea of the only time we<br>
ever try to test things for real is when we throw all 1 million lines of<br>
source code into one pot and stir. It really shouldn't be surprising how<br>
many bugs shake out there. And this is the wrong layer to debug from, so<br>
I firmly believe we need to change this back to something we can<br>
actually manage to shake the bugs out with. Because right now we're<br>
finding them, but our infrastructure isn't optimized for fixing them,<br>
and we need to change that.<br>
<div class="HOEnZb"><div class="h5"><br>
-Sean<br>
<br>
--<br>
Sean Dague<br>
<a href="http://dague.net" target="_blank">http://dague.net</a><br>
<br>
</div></div><br>_______________________________________________<br>
OpenStack-dev mailing list<br>
<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
<br></blockquote></div><br></div></div>