[openstack-dev] Changes coming in gate structure
sean at dague.net
Wed Jan 22 20:39:58 UTC 2014
Changes coming in gate structure
Unless you've been living under a rock, on the moon, around Saturn,
you'll have noticed that the gate has been quite backed up the last 2
weeks. Every time we get towards a milestone this gets measurably
worse, and the expectation at is at i3 we're going to see at least 40%
more load than we are dealing with now (if history is any indication),
which doesn't bode well.
It turns out, when you have a huge and rapidly growing Open Source
project, you keep finding scaling limits in existing software, your
software, and approaches in general. It also turns out that you find
out that you need to act defensively on situations that you didn't
think you'd have to worry about. Like code reviews with 3 month old
test results being put into the review queue. Or code that *can't*
pass (which a look at the logs would show) being reverified in the
All of these things compound on the fact that there are real bugs in
OpenStack, which end up having a non linear failure effect. Once you
get past a certain point the failure rates multiply to the point where
everything stops (which happened Sunday, when we only merged 4 changes
in 24 hrs).
The history of the gate structure is a long one. It was added in
Diablo when there was a project which literally would not run with
the other OpenStack components. The idea of gating merge of everything
on everything else is to ensure we have some understanding that
OpenStack actually works, all together, for some set of
It wasn't until Folsom cycle that we started running these tests before
Human review (kind of amazing).
The gate is also based on an assumption that most of the bugs we are
catching are outside to project, vs. bugs that are already in the
project. However, in an asynchronous system, bugs can show up only
very occasionally, and get past our best efforts to detect them, then
pile up in the code base until we rout them out.
Towards a Svelter Gate - Leaning on Check
We've got a current plan of attack to try to maintain nearly the same
level of integration test guarantees, and hope to make it so on the
merge side we're able to get more throughput. This is a set of things
that all have to happen at once to not completely blow out the
guarantees we've got in the source.
Make a clean recent Check prereq for entering gate
A huge compounding problem has been patches that can't pass being
promoted to the gate. So we're going to make Zuul able to enforce a
recent clean check scorecard before going into the gate. Our working
theory of recent is last 24hrs.
If it doesn't have a recent set of check results on +A, we'll trigger
a check rerun, and if clean, it gets sent to the gate.
We'll also probably add a sweeper to zuul so it will refresh results
on changes that are getting comments on them that are older than some
number of days automatically.
The gate jobs will be trimmed down immensely. Nothing project
specific, so pep8 / unit tests all ripped out, no functional test
runs. Less overall configs. Exactly how minimal we'll figure out as we
decide what we can live without. The floor for this would be
devstack-tempest-full and grenade.
This is basically sanity check that the combination of patches in
flight doesn't ruin the world for everyone.
Idle Cloud for Elastic Recheck Bugs
We have actually been using gate as double duty, both as ensuring
integration, but also as a set of clean test results to figure out
what bugs are in OpenStack that only show up from time to time. The
check queue is way too noisy, as our system actually blocks tons of
bad code from getting in.
With the Svelt gate, we'll need a set of background nodes to build
that dataset. But with elastic search we now have the technology, so
this is good.
It will let us work these issues in parallel. This issues will still
cause people pain in getting clean results in check.
Timelines, Dangers, and Opportunities
We need changes soon. Every past experience is milestone 3 is 40%
heavier than milestone 2, and nothing indicates that icehouse is going
to be any different. So Jim's put getting these required bits into
Zuul to the top of his list, and we're hoping we'll have them within a
With this approach, wedging the gate is highly unlikely. However as we
won't be testing every check test again in gate, it means there is a
possibility that a combination of patches might make the check results
wedge for everyone (like pg job gets wedged). So it moves that issue
around. Right now it's hard to say if that particular issue will get
better or worse. However the Sherlock rule of gate blocks remains in
effect: once you've eliminated the impossible, any gate blocking
scenario, however improbable, will eventually happen.
It will mean that the human error of promoting non passing code to the
gate will get stopped. That will help quite a bit. A few of us have
been manually pruning those changes out of the gate, and that helped
build up merge velocity again. The system will now work like we've
seen it needs to.
To summarize, the effects of these changes will be:
- 1) Decrease the impact of failures resetting the entire gate queue
by doing the heavy testing in the check queue where changes are not
dependent on each other.
- 2) Run a slimmer set of jobs in the gate queue to maintain sanity,
but not block as much on existing bugs in OpenStack.
- 3) As a result, this should increase our confidence that changes
put into the gate will pass. This will help prevent gate resets,
and the disruption they cause by needing to invalidate and restart
the whole gate queue.
And we'll be making getting this working a top priority, so we'll be
ready for Icehouse-3.
Samsung Research America
sean at dague.net / sean.dague at samsung.com
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 482 bytes
Desc: OpenPGP digital signature
More information about the OpenStack-dev