<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Jul 24, 2014 at 3:54 PM, Sean Dague <span dir="ltr"><<a href="mailto:sean@dague.net" target="_blank">sean@dague.net</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On 07/24/2014 05:57 PM, Matthew Treinish wrote:<br>

> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:<br>

>> OpenStack has a substantial CI system that is core to its development<br>

>> process.  The goals of the system are to facilitate merging good code,<br>

>> prevent regressions, and ensure that there is at least one configuration<br>

>> of upstream OpenStack that we know works as a whole.  The "project<br>

>> gating" technique that we use is effective at preventing many kinds of<br>

>> regressions from landing, however more subtle, non-deterministic bugs<br>

>> can still get through, and these are the bugs that are currently<br>

>> plaguing developers with seemingly random test failures.<br>

>><br>

>> Most of these bugs are not failures of the test system; they are real<br>

>> bugs.  Many of them have even been in OpenStack for a long time, but are<br>

>> only becoming visible now due to improvements in our tests.  That's not<br>

>> much help to developers whose patches are being hit with negative test<br>

>> results from unrelated failures.  We need to find a way to address the<br>

>> non-deterministic bugs that are lurking in OpenStack without making it<br>

>> easier for new bugs to creep in.<br>

>><br>

>> The CI system and project infrastructure are not static.  They have<br>

>> evolved with the project to get to where they are today, and the<br>

>> challenge now is to continue to evolve them to address the problems<br>

>> we're seeing now.  The QA and Infrastructure teams recently hosted a<br>

>> sprint where we discussed some of these issues in depth.  This post from<br>

>> Sean Dague goes into a bit of the background: [1].  The rest of this<br>

>> email outlines the medium and long-term changes we would like to make to<br>

>> address these problems.<br>

>><br>

>> [1] <a href="https://dague.net/2014/07/22/openstack-failures/" target="_blank">https://dague.net/2014/07/22/openstack-failures/</a><br>

>><br>

>> ==Things we're already doing==<br>

>><br>

>> The elastic-recheck tool[2] is used to identify "random" failures in<br>

>> test runs.  It tries to match failures to known bugs using signatures<br>

>> created from log messages.  It helps developers prioritize bugs by how<br>

>> frequently they manifest as test failures.  It also collects information<br>

>> on unclassified errors -- we can see how many (and which) test runs<br>

>> failed for an unknown reason and our overall progress on finding<br>

>> fingerprints for random failures.<br>

>><br>

>> [2] <a href="http://status.openstack.org/elastic-recheck/" target="_blank">http://status.openstack.org/elastic-recheck/</a><br>

>><br>

>> We added a feature to Zuul that lets us manually "promote" changes to<br>

>> the top of the Gate pipeline.  When the QA team identifies a change that<br>

>> fixes a bug that is affecting overall gate stability, we can move that<br>

>> change to the top of the queue so that it may merge more quickly.<br>

>><br>

>> We added the clean check facility in reaction to the January gate break<br>

>> down. While it does mean that any individual patch might see more tests<br>

>> run on it, it's now largely kept the gate queue at a countable number of<br>

>> hours, instead of regularly growing to more than a work day in<br>

>> length. It also means that a developer can Approve a code merge before<br>

>> tests have returned, and not ruin it for everyone else if there turned<br>

>> out to be a bug that the tests could catch.<br>

>><br>

>> ==Future changes==<br>

>><br>

>> ===Communication===<br>

>> We used to be better at communicating about the CI system.  As it and<br>

>> the project grew, we incrementally added to our institutional knowledge,<br>

>> but we haven't been good about maintaining that information in a form<br>

>> that new or existing contributors can consume to understand what's going<br>

>> on and why.<br>

>><br>

>> We have started on a major effort in that direction that we call the<br>

>> "infra-manual" project -- it's designed to be a comprehensive "user<br>

>> manual" for the project infrastructure, including the CI process.  Even<br>

>> before that project is complete, we will write a document that<br>

>> summarizes the CI system and ensure it is included in new developer<br>

>> documentation and linked to from test results.<br>

>><br>

>> There are also a number of ways for people to get involved in the CI<br>

>> system, whether focused on Infrastructure or QA, but it is not always<br>

>> clear how to do so.  We will improve our documentation to highlight how<br>

>> to contribute.<br>

>><br>

>> ===Fixing Faster===<br>

>><br>

>> We introduce bugs to OpenStack at some constant rate, which piles up<br>

>> over time. Our systems currently treat all changes as equally risky and<br>

>> important to the health of the system, which makes landing code changes<br>

>> to fix key bugs slow when we're at a high reset rate. We've got a manual<br>

>> process of promoting changes today to get around this, but that's<br>

>> actually quite costly in people time, and takes getting all the right<br>

>> people together at once to promote changes. You can see a number of the<br>

>> changes we promoted during the gate storm in June [3], and it was no<br>

>> small number of fixes to get us back to a reasonably passing gate. We<br>

>> think that optimizing this system will help us land fixes to critical<br>

>> bugs faster.<br>

>><br>

>> [3] <a href="https://etherpad.openstack.org/p/gatetriage-june2014" target="_blank">https://etherpad.openstack.org/p/gatetriage-june2014</a><br>

>><br>

>> The basic idea is to use the data from elastic recheck to identify that<br>

>> a patch is fixing a critical gate related bug. When one of these is<br>

>> found in the queues it will be given higher priority, including bubbling<br>

>> up to the top of the gate queue automatically. The manual promote<br>

>> process should no longer be needed, and instead bugs fixing elastic<br>

>> recheck tracked issues will be promoted automatically.<br>

>><br>

>> At the same time we'll also promote review on critical gate bugs through<br>

>> making them visible in a number of different channels (like on elastic<br>

>> recheck pages, review day, and in the gerrit dashboards). The idea here<br>

>> again is to make the reviews that fix key bugs pop to the top of<br>

>> everyone's views.<br>

>><br>

>> ===Testing more tactically===<br>

>><br>

>> One of the challenges that exists today is that we've got basically 2<br>

>> levels of testing in most of OpenStack: unit tests, and running a whole<br>

>> OpenStack cloud. Over time we've focused on adding more and more<br>

>> configurations and tests to the latter, but as we've seen, when things<br>

>> fail in a whole OpenStack cloud, getting to the root cause is often<br>

>> quite hard. So hard in fact that most people throw up their hands and<br>

>> just run 'recheck'. If a test run fails, and no one looks at why, does<br>

>> it provide any value?<br>

>><br>

>> We need to get to a balance where we are testing that OpenStack works as<br>

>> a whole in some configuration, but as we've seen, even our best and<br>

>> brightest can't seem to make OpenStack reliably boot a compute that has<br>

>> working networking 100% the time if we happen to be running more than 1<br>

>> API request at once.<br>

>><br>

>> Getting there is a multi party process:<br>

>><br>

>>   * Reduce the gating configurations down to some gold standard<br>

>>     configuration(s). This will be a small number of configurations that<br>

>>     we all agree that everything will gate on. This means things like<br>

>>     postgresql, cells, different environments will all get dropped from<br>

>>     the gate as we know it.<br>

>><br>

>>   * Put the burden for a bunch of these tests back on the projects as<br>

>>     "functional" tests. Basically a custom devstack environment that a<br>

>>     project can create with a set of services that they minimally need<br>

>>     to do their job. These functional tests will live in the project<br>

>>     tree, not in Tempest, so can be atomically landed as part of the<br>

>>     project normal development process.<br>

>><br>

>>   * For all non gold standard configurations, we'll dedicate a part of<br>

>>     our infrastructure to running them in a continuous background loop,<br>

>>     as well as making these configs available as experimental jobs. The<br>

>>     idea here is that we'll actually be able to provide more<br>

>>     configurations that are operating in a more traditional CI (post<br>

>>     merge) context. People that are interested in keeping these bits<br>

>>     functional can monitor those jobs and help with fixes when needed.<br>

>>     The experimental jobs mean that if developers are concerned about<br>

>>     the effect of a particular change on one of these configs, it's easy<br>

>>     to request a pre-merge test run.  In the near term we might imagine<br>

>>     this would allow for things like ceph, mongodb, docker, and possibly<br>

>>     very new libvirt to be validated in some way upstream.<br>

>><br>

>>   * Provide some kind of easy to view dashboards of these jobs, as well<br>

>>     as a policy that if some job is failing for > some period of time,<br>

>>     it's removed from the system. We want to provide whatever feedback<br>

>>     we can to engaged parties, but people do need to realize that<br>

>>     engagement is key. The biggest part of putting tests into OpenStack<br>

>>     isn't landing the tests, but dealing with their failures.<br>

>><br>

>>   * Encourage projects to specifically land interface tests in other<br>

>>     projects when they depend on certain behavior.<br>

><br>

> So I think we (or least I do) need clarification around this item. My question<br>

> is which interfaces are we depending on that need these specific types of<br>

> tests? Projects shouldn't be depending on another project's unstable interfaces.<br>

> If specific behavior is required for a cross-project interaction it should be<br>

> part of defined stable API, hopefully the REST API, and then that behavior<br>

> should be enforced for everyone not just the cross-project interaction.<br>

><br>

> If I'm interpreting this correctly the what is actually needed here is to<br>

> actually ensure that there is test coverage somewhere for the APIs that should<br>

> already be tested where there is a cross-project dependency. This is actually<br>

> the same thing we see all the time because there is a lack of test coverage<br>

> on certain APIs that are being used. (the nova default quotas example comes to<br>

> mind) I just think calling this a special class of test is a bit misleading.<br>

> Since it shouldn't actually differ than any other API test. Or am I missing<br>

> something?<br>

<br>

</div></div>Projects are consuming the behavior of other projects far beyond just<br>

the formal REST APIs. Notifications is another great instance of that.<br></blockquote><div><br></div><div>I think the fact that notification aren't versioned or really 'contractual' but are being used as such to be a huge issue.</div>


<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

This is also more of a pragmatic organic approach to figuring out the<br>

interfaces we need to lock down. When one projects breaks depending on<br>

an interface in another project, that should trigger this kind of<br>

contract growth, which hopefully formally turns into a document later<br>

for a stable interface.<br></blockquote><div><br></div><div>This approach sounds like a recipe for us playing a never ending game of catch up, when issues like this have arrisen in the past we rarely get around to creating a stable interface (see notifications). I would rather push on using stable APIs (and creating them as needed) instead.</div>


<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div><div class="h5"><br>

>> Let's imagine an example of how this works in the real world.<br>

>><br>

>>   * The heat-slow job is deleted.<br>

>><br>

>>   * The heat team creates a specific functional job which tests some of<br>

>>     their deeper function in Heat, all the tests live in Heat, and<br>

>>     because of these the tests can include white/grey box testing of the<br>

>>     DB and queues while things are progressing.<br>

>><br>

>>   * Nova lands a change which neither Tempest or our configs exercise,<br>

>>     but breaks Heat.<br>

>><br>

>>   * The Heat project can now decide if it's more important to keep the<br>

>>     test in place (preventing them from landing code), or to skip it to<br>

>>     get back to work.<br>

>><br>

>>   * The Heat team then works on the right fix for Nova, or communicates<br>

>>     with the Nova team on the issue at hand. The fix to Nova *also*<br>

>>     should include tests which locks down that interface so that Nova<br>

>>     won't break them again in the future (the ironic team did this with<br>

>>     their test_ironic_contract patch). These tests could be unit tests,<br>

>>     if they are testable that way, or functional tests in the Nova tree.<br>

><br>

> The one thing I want to point out here is that ironic_contract test should be<br>

> an exception, I don't feel that we want that to be the norm. It's not a good<br>

> example for a few reasons, mostly around the fact that ironic tree depends on<br>

> the purposefully unstable nova driver api as temporary measure until the ironic<br>

> driver is merged into the nova tree. The contract api tests will go away once<br>

> the driver is in the nova tree. It should not be necessary for something over<br>

> the REST API, since the contact should be enforced through tempest. (even under<br>

> this new model, I expect this to still be true)<br>

><br>

> There was that comment which someone (I can't remember who) brought up at the<br>

> Portland summit that tempest is acting like double book accounting for the api<br>

> contract, and that has been something we've seen as extremely valuable<br>

> historically. Which is why I don't want to see this aspect of tempest's role in<br>

> the gate altered.<br>

<br>

</div></div>I've been the holder of the double book accounting pov in the past.<br>

However, after the last six months of fragility, I just don't see how<br>

that's a sustainable point of view. The QA team remains somewhat<br>

constant size, and the number of interfaces and projects grows at a good<br>

clip.<br>

<div class=""><br>

> Although, all I think we actually need is an api definition for testing in an<br>

> external repo, just to prevent inadvertent changes. (whether that gets used in<br>

> tempest or not) So another alternative I see here is something that I've started<br>

> to outline in [4] to address the potential for code duplication and effort in<br>

> the new functional test suites. If all the project specific functional tests are<br>

> using clients from an external functional testing library repo then this concern<br>

> goes away.<br>

<br>

</div>Actually, I don't think these wold be using external clients. This is in<br>

tree testing.<br>

<br>

This will definitely be an experiment to get the API testing closer to<br>

the source. That being said, Swift really has done this fine for a long<br>

time, and I think we need to revisit the premise that projects can't be<br>

trusted.<br>

<div class=""><br>

> Now, if something like this example were to be exposed because of a coverage<br>

> gap I think it's fair game to have a specific test in nova's functional test<br>

> suite. But, I also think there should be an external audit of that API somewhere<br>

> too. Ideally I think what I'd like to see is probably a write-once test<br>

> graduation procedure for moving appropriate things into tempest (or somewhere<br>

> else) from the project specific functional tests. Basically like what we<br>

> discussed during Maru's summit session on Neutron functional testing in Atlanta.<br>

<br>

</div>Right, and I think basically we shouldn't graduate most of those tests.<br>

They are neutron tests, in the neutron tree. A few key ones we decide<br>

should be run outside that context.<br>

<div class=""><br>

> For the other, more social, goal of this step in fostering communication between<br>

> the projects and not using QA and/or Infra as a middle man I fully support. I<br>

> agree that we probably have too proxying going on between projects using QA<br>

> and/or infra instead of necessarily talking directly.<br>

<br>

</div>Our current model leans far too much on the idea of the only time we<br>

ever try to test things for real is when we throw all 1 million lines of<br>

source code into one pot and stir. It really shouldn't be surprising how<br>

many bugs shake out there. And this is the wrong layer to debug from, so<br>

I firmly believe we need to change this back to something we can<br>

actually manage to shake the bugs out with. Because right now we're<br>

finding them, but our infrastructure isn't optimized for fixing them,<br>

and we need to change that.<br>

<div class="HOEnZb"><div class="h5"><br>

        -Sean<br>

<br>

--<br>

Sean Dague<br>

<a href="http://dague.net" target="_blank">http://dague.net</a><br>

<br>

</div></div><br>_______________________________________________<br>

OpenStack-dev mailing list<br>

<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

<br></blockquote></div><br></div></div>