<p dir="ltr"><br>

On Jun 14, 2014 11:12 AM, "Robert Collins" <<a href="mailto:robertc@robertcollins.net">robertc@robertcollins.net</a>> wrote:<br>

><br>

> You know its bad when you can't sleep because you're redesigning gate<br>

> workflows in your head.... so I apologise that this email is perhaps<br>

> not as rational, nor as organised, as usual - but , ^^^^. :)<br>

><br>

> Obviously this is very important to address, and if we can come up<br>

> with something systemic I'm going to devote my time both directly, and<br>

> via resource-hunting within HP, to address it. And accordingly I'm<br>

> going to feel free to say 'zuul this' with no regard for existing<br>

> features. We need to get ahead of the problem and figure out how to<br>

> stay there, and I think below I show why the current strategy just<br>

> won't do that.<br>

><br>

> On 13 June 2014 06:08, Sean Dague <<a href="mailto:sean@dague.net">sean@dague.net</a>> wrote:<br>

><br>

> > We're hitting a couple of inflection points.<br>

> ><br>

> > 1) We're basically at capacity for the unit work that we can do. Which<br>

> > means it's time to start making decisions if we believe everything we<br>

> > currently have running is more important than the things we aren't<br>

> > currently testing.<br>

> ><br>

> > Everyone wants multinode testing in the gate. It would be impossible to<br>

> > support that given current resources.<br>

><br>

> How much of our capacity problems are due to waste - such as:<br>

>  - tempest runs of code the author knows is broken<br>

>  - tempest runs of code that doesn't pass unit tests<br>

>  - tempest runs while the baseline is unstable - to expand on this<br>

> one, if master only passes one commit in 4, no check job can have a<br>

> higher success rate overall.<br>

><br>

> Vs how much are an indication of the sheer volume of development being done?<br>

><br>

> > 2) We're far past the inflection point of people actually debugging jobs<br>

> > when they go wrong.<br>

> ><br>

> > The gate is backed up (currently to 24hrs) because there are bugs in<br>

> > OpenStack. Those are popping up at a rate much faster than the number of<br>

> > people who are willing to spend any time on them. And often they are<br>

> > popping up in configurations that we're not all that familiar with.<br>

><br>

> So, I *totally* appreciate that people fixing the jobs is the visible<br>

> expendable resource, but I'm not sure its the bottleneck. I think the<br>

> bottleneck is our aggregate ability to a) detect the problem and b)<br>

> resolve it.<br>

><br>

> For instance - strawman - if when the gate goes bad, after a check for<br>

> external issues like new SQLAlchemy releases etc, what if we just<br>

> rolled trunk of every project that is in the integrated gate back to<br>

> before the success rate nosedived ? I'm well aware of the DVCS issues<br>

> that implies, but from a human debugging perspective that would<br>

> massively increase the leverage we get from the folk that do dive in<br>

> and help. It moves from 'figure out that there is a problem and it<br>

> came in after X AND FIX IT' to 'figure out it came in after X'.<br>

><br>

> Reverting is usually much faster and more robust than rolling forward,<br>

> because rolling forward has more unknowns.<br>

><br>

> I think we have a systematic problem, because this situation happens<br>

> again and again. And the root cause is that our time to detect<br>

> races/nondeterministic tests is a probability function, not a simple<br>

> scalar. Sometimes we catch such tests within one patch in the gate,<br>

> sometimes they slip through. If we want to land hundreds or thousands<br>

> of patches a day, and we don't want this pain to happen, I don't see<br>

> any way other than *either*:<br>

> A - not doing this whole gating CI process at all<br>

> B - making detection a whole lot more reliable (e.g. we want<br>

> near-certainty that a given commit does not contain a race)<br>

> C - making repair a whole lot faster (e.g. we want <= one test cycle<br>

> in the gate to recover once we have determined that some commit is<br>

> broken.<br>

><br>

> Taking them in turn:<br>

> A - yeah, no. We have lots of experience with the axiom that that<br>

> which is not tested is broken. And thats the big concern about<br>

> removing things from our matrix - when they are not tested, we can be<br>

> sure that they will break and we will have to spend neurons fixing<br>

> them - either directly or as reviews from people fixing it.<br>

><br>

> B - this is really hard. Say we want quite sure sure that there are no<br>

> new races that will occur with more than some probability in a given<br>

> commit, and we assume that race codepaths might be run just once in<br>

> the whole test matrix. A single test run can never tell us that - it<br>

> just tells us it worked. What we need is some N trials where we don't<br>

> observe a new race (but may observe old races), given a maximum risk<br>

> of the introduction of a (say) 5% failure rate into the gate. [check<br>

> my stats]<br>

> (1-max risk)^trials = margin-of-error<br>

> 0.95^N = 0.01<br>

> log(0.01, base=0.95) = N<br>

> N ~= 90<br>

><br>

> So if we want to stop 5% races landing, and we may exercise any given<br>

> possible race code path a minimum of 1 times in the test matrix, we<br>

> need to exercise the whole test matrix 90 times to have that 1% margin<br>

> sure we saw it. Raise that to a 1% race:<br>

> log(0.01. base=0.99) = 458<br>

> Thats a lot of test runs. I don't think we can do that for each commit<br>

> with our current resources - and I'm not at all sure that asking for<br>

> enough resources to do that makes sense. Maybe it does.<br>

><br>

> Data point - our current risk, with 1% margin:<br>

> (1-max risk)^1 = 0.01<br>

> 99% (that is, a single passing gate run will happily let through races<br>

> with any amount of fail, given enough trials). In fact, its really<br>

> just a numbers game for us at the moment - and we keep losing.<br>

><br>

> B1. We could change our definition from a per-commit basis to instead<br>

> saying 'within a given number of commits we want the probability of a<br>

> new race to be low' - amortise the cost of gaining lots of confidence<br>

> over more commits. It might work something like:<br>

>  - run regular gate runs of things of deeper and deeper zuul refs<br>

>  - failures eject single commits as usual<br>

>  - don't propogate successes.<br>

>  - keep going until have 100 commits all validated but not propogated,<br>

> *or* more than (some time window, lets say 5 hours) has passed<br>

>  - start 500 test runs of all those commits, in parallel<br>

>  - if it fails, eject the whole window<br>

>  - otherwise let it in.<br>

><br>

> This might let races in individual commits within the window through<br>

> if and only if they are also fixed within the same window; coarse<br>

> failures like basic API incompatibility or failure to use deps right<br>

> would be detected as they are today. There's obviously room for<br>

> speculative execution on the whole window in fact: run 600 jobs, 100<br>

> the zuul ref build-up and 500 the confidence interval builder.<br>

><br>

> The downside of this approach is that there is a big window (because<br>

> its amortising a big expense) which will all go in together, or not at<br>

> all. And we'd have to prevent *all those commits* from being<br>

> resubmitted until the cause of the failure was identified and actively<br>

> fixed. We'd want that to be enforced, not run on the honour system,<br>

> because any of those commits can bounce the whole set out. The flip<br>

> side is that it would be massively more effective at keeping bad<br>

> commits out.<br>

><br>

> B2. ??? I had some crazy idea of multiple branches with more and more<br>

> confidence in them, but I think they all actually boil down to<br>

> variations on a them of B1, and if we move the centre of developer<br>

> mass to $wherever, the gate for that is where the pain will be felt.<br>

><br>

> C - If we can't make it harder to get races in, perhaps we can make it<br>

> easier to get races out. We have pretty solid emergent statistics from<br>

> every gate job that is run as check. What if set a policy that when a<br>

> gate queue gets a race:<br>

>  - put a zuul stop all merges and checks on all involved branches<br>

> (prevent further damage, free capacity for validation)<br>

>  - figure out when it surfaced<br>

>  - determine its not an external event<br>

>  - revert all involved branches back to the point where they looked<br>

> good, as one large operation<br>

>    - run that through jenkins N (e.g. 458) times in parallel.</p>

<p dir="ltr">Do we have enough compute resources to do this?</p>

<p dir="ltr">>    - on success land it<br>

>  - go through all the merges that have been reverted and either<br>

> twiddle them to be back in review with a new patchset against the<br>

> revert to restore their content, or alternatively generate new reviews<br>

> if gerrit would make that too hard.<br>

><br>

><br>

> Getting folk to help<br>

> ==============<br>

><br>

> On the social side there is currently very little direct signalling<br>

> that the gate is in trouble : I don't mean there is no communication -<br>

> there's lots. What I mean is that Fred, a developer not on the lists<br>

> or IRC for whatever reason, pushing code up, has no signal until they<br>

> go 'why am I not getting check results', visit the status page and go<br>

> 'whoa'.<br>

><br>

> Maybe we can do something about that. For instance, when a gate is in<br>

> trouble, have zuul not schedule check jobs at all, and refuse to do<br>

> rechecks / revalidates in affected branches, unless the patch in<br>

> question is a partial-bug: or bug: for one of the gate bugs. Zuul can<br>

> communicate the status on the patch, so the developer knows.<br>

> This will:<br>

>  - free up capacity for testing whatever fix is being done for the issue<br>

>  - avoid waste, since we know there is a high probability of spurious failures<br>

>  - provide a clear signal that the project expectation is that when<br>

> the gate is broken, fixing it is the highest priority<br>

><br>

> > Landing a gating job comes with maintenance. Maintenance in looking into<br>

> > failures, and not just running recheck. So there is an overhead to<br>

> > testing this many different configurations.<br>

> ><br>

> > I think #2 is just as important to realize as #1. As such I think we<br>

> > need to get to the point where there are a relatively small number of<br>

> > configurations that Infra/QA support, and beyond that every job needs<br>

> > sponsors. And if the job success or # of uncategorized fails go past<br>

> > some thresholds, we demote them to non-voting, and if you are non-voting<br>

> > for > 1 month, you get demoted to experimental (or some specific<br>

> > timeline, details to be sorted).<br>

><br>

><br>

> --<br>

> Robert Collins <<a href="mailto:rbtcollins@hp.com">rbtcollins@hp.com</a>><br>

> Distinguished Technologist<br>

> HP Converged Cloud<br>

><br>

> _______________________________________________<br>

> OpenStack-dev mailing list<br>

> <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>

> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

</p>