Open Stack

Mon Jun 16 18:49:16 UTC 2014

On Jun 14, 2014 11:12 AM, "Robert Collins" <robertc at robertcollins.net>
wrote:
>
> You know its bad when you can't sleep because you're redesigning gate
> workflows in your head.... so I apologise that this email is perhaps
> not as rational, nor as organised, as usual - but , ^^^^. :)
>
> Obviously this is very important to address, and if we can come up
> with something systemic I'm going to devote my time both directly, and
> via resource-hunting within HP, to address it. And accordingly I'm
> going to feel free to say 'zuul this' with no regard for existing
> features. We need to get ahead of the problem and figure out how to
> stay there, and I think below I show why the current strategy just
> won't do that.
>
> On 13 June 2014 06:08, Sean Dague <sean at dague.net> wrote:
>
> > We're hitting a couple of inflection points.
> >
> > 1) We're basically at capacity for the unit work that we can do. Which
> > means it's time to start making decisions if we believe everything we
> > currently have running is more important than the things we aren't
> > currently testing.
> >
> > Everyone wants multinode testing in the gate. It would be impossible to
> > support that given current resources.
>
> How much of our capacity problems are due to waste - such as:
>  - tempest runs of code the author knows is broken
>  - tempest runs of code that doesn't pass unit tests
>  - tempest runs while the baseline is unstable - to expand on this
> one, if master only passes one commit in 4, no check job can have a
> higher success rate overall.
>
> Vs how much are an indication of the sheer volume of development being
done?
>
> > 2) We're far past the inflection point of people actually debugging jobs
> > when they go wrong.
> >
> > The gate is backed up (currently to 24hrs) because there are bugs in
> > OpenStack. Those are popping up at a rate much faster than the number of
> > people who are willing to spend any time on them. And often they are
> > popping up in configurations that we're not all that familiar with.
>
> So, I *totally* appreciate that people fixing the jobs is the visible
> expendable resource, but I'm not sure its the bottleneck. I think the
> bottleneck is our aggregate ability to a) detect the problem and b)
> resolve it.
>
> For instance - strawman - if when the gate goes bad, after a check for
> external issues like new SQLAlchemy releases etc, what if we just
> rolled trunk of every project that is in the integrated gate back to
> before the success rate nosedived ? I'm well aware of the DVCS issues
> that implies, but from a human debugging perspective that would
> massively increase the leverage we get from the folk that do dive in
> and help. It moves from 'figure out that there is a problem and it
> came in after X AND FIX IT' to 'figure out it came in after X'.
>
> Reverting is usually much faster and more robust than rolling forward,
> because rolling forward has more unknowns.
>
> I think we have a systematic problem, because this situation happens
> again and again. And the root cause is that our time to detect
> races/nondeterministic tests is a probability function, not a simple
> scalar. Sometimes we catch such tests within one patch in the gate,
> sometimes they slip through. If we want to land hundreds or thousands
> of patches a day, and we don't want this pain to happen, I don't see
> any way other than *either*:
> A - not doing this whole gating CI process at all
> B - making detection a whole lot more reliable (e.g. we want
> near-certainty that a given commit does not contain a race)
> C - making repair a whole lot faster (e.g. we want <= one test cycle
> in the gate to recover once we have determined that some commit is
> broken.
>
> Taking them in turn:
> A - yeah, no. We have lots of experience with the axiom that that
> which is not tested is broken. And thats the big concern about
> removing things from our matrix - when they are not tested, we can be
> sure that they will break and we will have to spend neurons fixing
> them - either directly or as reviews from people fixing it.
>
> B - this is really hard. Say we want quite sure sure that there are no
> new races that will occur with more than some probability in a given
> commit, and we assume that race codepaths might be run just once in
> the whole test matrix. A single test run can never tell us that - it
> just tells us it worked. What we need is some N trials where we don't
> observe a new race (but may observe old races), given a maximum risk
> of the introduction of a (say) 5% failure rate into the gate. [check
> my stats]
> (1-max risk)^trials = margin-of-error
> 0.95^N = 0.01
> log(0.01, base=0.95) = N
> N ~= 90
>
> So if we want to stop 5% races landing, and we may exercise any given
> possible race code path a minimum of 1 times in the test matrix, we
> need to exercise the whole test matrix 90 times to have that 1% margin
> sure we saw it. Raise that to a 1% race:
> log(0.01. base=0.99) = 458
> Thats a lot of test runs. I don't think we can do that for each commit
> with our current resources - and I'm not at all sure that asking for
> enough resources to do that makes sense. Maybe it does.
>
> Data point - our current risk, with 1% margin:
> (1-max risk)^1 = 0.01
> 99% (that is, a single passing gate run will happily let through races
> with any amount of fail, given enough trials). In fact, its really
> just a numbers game for us at the moment - and we keep losing.
>
> B1. We could change our definition from a per-commit basis to instead
> saying 'within a given number of commits we want the probability of a
> new race to be low' - amortise the cost of gaining lots of confidence
> over more commits. It might work something like:
>  - run regular gate runs of things of deeper and deeper zuul refs
>  - failures eject single commits as usual
>  - don't propogate successes.
>  - keep going until have 100 commits all validated but not propogated,
> *or* more than (some time window, lets say 5 hours) has passed
>  - start 500 test runs of all those commits, in parallel
>  - if it fails, eject the whole window
>  - otherwise let it in.
>
> This might let races in individual commits within the window through
> if and only if they are also fixed within the same window; coarse
> failures like basic API incompatibility or failure to use deps right
> would be detected as they are today. There's obviously room for
> speculative execution on the whole window in fact: run 600 jobs, 100
> the zuul ref build-up and 500 the confidence interval builder.
>
> The downside of this approach is that there is a big window (because
> its amortising a big expense) which will all go in together, or not at
> all. And we'd have to prevent *all those commits* from being
> resubmitted until the cause of the failure was identified and actively
> fixed. We'd want that to be enforced, not run on the honour system,
> because any of those commits can bounce the whole set out. The flip
> side is that it would be massively more effective at keeping bad
> commits out.
>
> B2. ??? I had some crazy idea of multiple branches with more and more
> confidence in them, but I think they all actually boil down to
> variations on a them of B1, and if we move the centre of developer
> mass to $wherever, the gate for that is where the pain will be felt.
>
> C - If we can't make it harder to get races in, perhaps we can make it
> easier to get races out. We have pretty solid emergent statistics from
> every gate job that is run as check. What if set a policy that when a
> gate queue gets a race:
>  - put a zuul stop all merges and checks on all involved branches
> (prevent further damage, free capacity for validation)
>  - figure out when it surfaced
>  - determine its not an external event
>  - revert all involved branches back to the point where they looked
> good, as one large operation
>    - run that through jenkins N (e.g. 458) times in parallel.

Do we have enough compute resources to do this?

>    - on success land it
>  - go through all the merges that have been reverted and either
> twiddle them to be back in review with a new patchset against the
> revert to restore their content, or alternatively generate new reviews
> if gerrit would make that too hard.
>
>
> Getting folk to help
> ==============
>
> On the social side there is currently very little direct signalling
> that the gate is in trouble : I don't mean there is no communication -
> there's lots. What I mean is that Fred, a developer not on the lists
> or IRC for whatever reason, pushing code up, has no signal until they
> go 'why am I not getting check results', visit the status page and go
> 'whoa'.
>
> Maybe we can do something about that. For instance, when a gate is in
> trouble, have zuul not schedule check jobs at all, and refuse to do
> rechecks / revalidates in affected branches, unless the patch in
> question is a partial-bug: or bug: for one of the gate bugs. Zuul can
> communicate the status on the patch, so the developer knows.
> This will:
>  - free up capacity for testing whatever fix is being done for the issue
>  - avoid waste, since we know there is a high probability of spurious
failures
>  - provide a clear signal that the project expectation is that when
> the gate is broken, fixing it is the highest priority
>
> > Landing a gating job comes with maintenance. Maintenance in looking into
> > failures, and not just running recheck. So there is an overhead to
> > testing this many different configurations.
> >
> > I think #2 is just as important to realize as #1. As such I think we
> > need to get to the point where there are a relatively small number of
> > configurations that Infra/QA support, and beyond that every job needs
> > sponsors. And if the job success or # of uncategorized fails go past
> > some thresholds, we demote them to non-voting, and if you are non-voting
> > for > 1 month, you get demoted to experimental (or some specific
> > timeline, details to be sorted).
>
>
> --
> Robert Collins <rbtcollins at hp.com>
> Distinguished Technologist
> HP Converged Cloud
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140616/b02cde31/attachment.html>

Open Stack

[openstack-dev] Gate proposal - drop Postgresql configurations in the gate

OpenStack

Community

Documentation

Branding & Legal