Open Stack

Sat Jun 14 18:11:49 UTC 2014

You know its bad when you can't sleep because you're redesigning gate
workflows in your head.... so I apologise that this email is perhaps
not as rational, nor as organised, as usual - but , ^^^^. :)

Obviously this is very important to address, and if we can come up
with something systemic I'm going to devote my time both directly, and
via resource-hunting within HP, to address it. And accordingly I'm
going to feel free to say 'zuul this' with no regard for existing
features. We need to get ahead of the problem and figure out how to
stay there, and I think below I show why the current strategy just
won't do that.

On 13 June 2014 06:08, Sean Dague <sean at dague.net> wrote:

> We're hitting a couple of inflection points.
>
> 1) We're basically at capacity for the unit work that we can do. Which
> means it's time to start making decisions if we believe everything we
> currently have running is more important than the things we aren't
> currently testing.
>
> Everyone wants multinode testing in the gate. It would be impossible to
> support that given current resources.

How much of our capacity problems are due to waste - such as:
 - tempest runs of code the author knows is broken
 - tempest runs of code that doesn't pass unit tests
 - tempest runs while the baseline is unstable - to expand on this
one, if master only passes one commit in 4, no check job can have a
higher success rate overall.

Vs how much are an indication of the sheer volume of development being done?

> 2) We're far past the inflection point of people actually debugging jobs
> when they go wrong.
>
> The gate is backed up (currently to 24hrs) because there are bugs in
> OpenStack. Those are popping up at a rate much faster than the number of
> people who are willing to spend any time on them. And often they are
> popping up in configurations that we're not all that familiar with.

So, I *totally* appreciate that people fixing the jobs is the visible
expendable resource, but I'm not sure its the bottleneck. I think the
bottleneck is our aggregate ability to a) detect the problem and b)
resolve it.

For instance - strawman - if when the gate goes bad, after a check for
external issues like new SQLAlchemy releases etc, what if we just
rolled trunk of every project that is in the integrated gate back to
before the success rate nosedived ? I'm well aware of the DVCS issues
that implies, but from a human debugging perspective that would
massively increase the leverage we get from the folk that do dive in
and help. It moves from 'figure out that there is a problem and it
came in after X AND FIX IT' to 'figure out it came in after X'.

Reverting is usually much faster and more robust than rolling forward,
because rolling forward has more unknowns.

I think we have a systematic problem, because this situation happens
again and again. And the root cause is that our time to detect
races/nondeterministic tests is a probability function, not a simple
scalar. Sometimes we catch such tests within one patch in the gate,
sometimes they slip through. If we want to land hundreds or thousands
of patches a day, and we don't want this pain to happen, I don't see
any way other than *either*:
A - not doing this whole gating CI process at all
B - making detection a whole lot more reliable (e.g. we want
near-certainty that a given commit does not contain a race)
C - making repair a whole lot faster (e.g. we want <= one test cycle
in the gate to recover once we have determined that some commit is
broken.

Taking them in turn:
A - yeah, no. We have lots of experience with the axiom that that
which is not tested is broken. And thats the big concern about
removing things from our matrix - when they are not tested, we can be
sure that they will break and we will have to spend neurons fixing
them - either directly or as reviews from people fixing it.

B - this is really hard. Say we want quite sure sure that there are no
new races that will occur with more than some probability in a given
commit, and we assume that race codepaths might be run just once in
the whole test matrix. A single test run can never tell us that - it
just tells us it worked. What we need is some N trials where we don't
observe a new race (but may observe old races), given a maximum risk
of the introduction of a (say) 5% failure rate into the gate. [check
my stats]
(1-max risk)^trials = margin-of-error
0.95^N = 0.01
log(0.01, base=0.95) = N
N ~= 90

So if we want to stop 5% races landing, and we may exercise any given
possible race code path a minimum of 1 times in the test matrix, we
need to exercise the whole test matrix 90 times to have that 1% margin
sure we saw it. Raise that to a 1% race:
log(0.01. base=0.99) = 458
Thats a lot of test runs. I don't think we can do that for each commit
with our current resources - and I'm not at all sure that asking for
enough resources to do that makes sense. Maybe it does.

Data point - our current risk, with 1% margin:
(1-max risk)^1 = 0.01
99% (that is, a single passing gate run will happily let through races
with any amount of fail, given enough trials). In fact, its really
just a numbers game for us at the moment - and we keep losing.

B1. We could change our definition from a per-commit basis to instead
saying 'within a given number of commits we want the probability of a
new race to be low' - amortise the cost of gaining lots of confidence
over more commits. It might work something like:
 - run regular gate runs of things of deeper and deeper zuul refs
 - failures eject single commits as usual
 - don't propogate successes.
 - keep going until have 100 commits all validated but not propogated,
*or* more than (some time window, lets say 5 hours) has passed
 - start 500 test runs of all those commits, in parallel
 - if it fails, eject the whole window
 - otherwise let it in.

This might let races in individual commits within the window through
if and only if they are also fixed within the same window; coarse
failures like basic API incompatibility or failure to use deps right
would be detected as they are today. There's obviously room for
speculative execution on the whole window in fact: run 600 jobs, 100
the zuul ref build-up and 500 the confidence interval builder.

The downside of this approach is that there is a big window (because
its amortising a big expense) which will all go in together, or not at
all. And we'd have to prevent *all those commits* from being
resubmitted until the cause of the failure was identified and actively
fixed. We'd want that to be enforced, not run on the honour system,
because any of those commits can bounce the whole set out. The flip
side is that it would be massively more effective at keeping bad
commits out.

B2. ??? I had some crazy idea of multiple branches with more and more
confidence in them, but I think they all actually boil down to
variations on a them of B1, and if we move the centre of developer
mass to $wherever, the gate for that is where the pain will be felt.

C - If we can't make it harder to get races in, perhaps we can make it
easier to get races out. We have pretty solid emergent statistics from
every gate job that is run as check. What if set a policy that when a
gate queue gets a race:
 - put a zuul stop all merges and checks on all involved branches
(prevent further damage, free capacity for validation)
 - figure out when it surfaced
 - determine its not an external event
 - revert all involved branches back to the point where they looked
good, as one large operation
   - run that through jenkins N (e.g. 458) times in parallel.
   - on success land it
 - go through all the merges that have been reverted and either
twiddle them to be back in review with a new patchset against the
revert to restore their content, or alternatively generate new reviews
if gerrit would make that too hard.

Getting folk to help
==============

On the social side there is currently very little direct signalling
that the gate is in trouble : I don't mean there is no communication -
there's lots. What I mean is that Fred, a developer not on the lists
or IRC for whatever reason, pushing code up, has no signal until they
go 'why am I not getting check results', visit the status page and go
'whoa'.

Maybe we can do something about that. For instance, when a gate is in
trouble, have zuul not schedule check jobs at all, and refuse to do
rechecks / revalidates in affected branches, unless the patch in
question is a partial-bug: or bug: for one of the gate bugs. Zuul can
communicate the status on the patch, so the developer knows.
This will:
 - free up capacity for testing whatever fix is being done for the issue
 - avoid waste, since we know there is a high probability of spurious failures
 - provide a clear signal that the project expectation is that when
the gate is broken, fixing it is the highest priority

> Landing a gating job comes with maintenance. Maintenance in looking into
> failures, and not just running recheck. So there is an overhead to
> testing this many different configurations.
>
> I think #2 is just as important to realize as #1. As such I think we
> need to get to the point where there are a relatively small number of
> configurations that Infra/QA support, and beyond that every job needs
> sponsors. And if the job success or # of uncategorized fails go past
> some thresholds, we demote them to non-voting, and if you are non-voting
> for > 1 month, you get demoted to experimental (or some specific
> timeline, details to be sorted).

-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Converged Cloud

Open Stack

[openstack-dev] Gate proposal - drop Postgresql configurations in the gate

OpenStack

Community

Documentation

Branding & Legal