[openstack-dev] Changes coming in gate structure

John Griffith john.griffith at solidfire.com
Wed Jan 22 20:53:45 UTC 2014

On Wed, Jan 22, 2014 at 1:39 PM, Sean Dague <sean at dague.net> wrote:
> ================================
> Changes coming in gate structure
> ================================
> Unless you've been living under a rock, on the moon, around Saturn,
> you'll have noticed that the gate has been quite backed up the last 2
> weeks. Every time we get towards a milestone this gets measurably
> worse, and the expectation at is at i3 we're going to see at least 40%
> more load than we are dealing with now (if history is any indication),
> which doesn't bode well.
> It turns out, when you have a huge and rapidly growing Open Source
> project, you keep finding scaling limits in existing software, your
> software, and approaches in general. It also turns out that you find
> out that you need to act defensively on situations that you didn't
> think you'd have to worry about. Like code reviews with 3 month old
> test results being put into the review queue. Or code that *can't*
> pass (which a look at the logs would show) being reverified in the
> gate.
> All of these things compound on the fact that there are real bugs in
> OpenStack, which end up having a non linear failure effect. Once you
> get past a certain point the failure rates multiply to the point where
> everything stops (which happened Sunday, when we only merged 4 changes
> in 24 hrs).
> The history of the gate structure is a long one. It was added in
> Diablo when there was a project which literally would not run with
> the other OpenStack components. The idea of gating merge of everything
> on everything else is to ensure we have some understanding that
> OpenStack actually works, all together, for some set of
> configurations.
> It wasn't until Folsom cycle that we started running these tests before
> Human review (kind of amazing).
> The gate is also based on an assumption that most of the bugs we are
> catching are outside to project, vs. bugs that are already in the
> project. However, in an asynchronous system, bugs can show up only
> very occasionally, and get past our best efforts to detect them, then
> pile up in the code base until we rout them out.
> =========================================
> Towards a Svelter Gate - Leaning on Check
> =========================================
> We've got a current plan of attack to try to maintain nearly the same
> level of integration test guarantees, and hope to make it so on the
> merge side we're able to get more throughput. This is a set of things
> that all have to happen at once to not completely blow out the
> guarantees we've got in the source.
> Make a clean recent Check prereq for entering gate
> ==================================================
> A huge compounding problem has been patches that can't pass being
> promoted to the gate. So we're going to make Zuul able to enforce a
> recent clean check scorecard before going into the gate. Our working
> theory of recent is last 24hrs.
> If it doesn't have a recent set of check results on +A, we'll trigger
> a check rerun, and if clean, it gets sent to the gate.
> We'll also probably add a sweeper to zuul so it will refresh results
> on changes that are getting comments on them that are older than some
> number of days automatically.
> Svelt Gate
> ==========
> The gate jobs will be trimmed down immensely. Nothing project
> specific, so pep8 / unit tests all ripped out, no functional test
> runs. Less overall configs. Exactly how minimal we'll figure out as we
> decide what we can live without. The floor for this would be
> devstack-tempest-full and grenade.
> This is basically sanity check that the combination of patches in
> flight doesn't ruin the world for everyone.
> Idle Cloud for Elastic Recheck Bugs
> ===================================
> We have actually been using gate as double duty, both as ensuring
> integration, but also as a set of clean test results to figure out
> what bugs are in OpenStack that only show up from time to time. The
> check queue is way too noisy, as our system actually blocks tons of
> bad code from getting in.
> With the Svelt gate, we'll need a set of background nodes to build
> that dataset. But with elastic search we now have the technology, so
> this is good.
> It will let us work these issues in parallel. This issues will still
> cause people pain in getting clean results in check.
> =========================
> Timelines, Dangers, and Opportunities
> =========================
> We need changes soon. Every past experience is milestone 3 is 40%
> heavier than milestone 2, and nothing indicates that icehouse is going
> to be any different. So Jim's put getting these required bits into
> Zuul to the top of his list, and we're hoping we'll have them within a
> week.
> With this approach, wedging the gate is highly unlikely. However as we
> won't be testing every check test again in gate, it means there is a
> possibility that a combination of patches might make the check results
> wedge for everyone (like pg job gets wedged). So it moves that issue
> around. Right now it's hard to say if that particular issue will get
> better or worse. However the Sherlock rule of gate blocks remains in
> effect: once you've eliminated the impossible, any gate blocking
> scenario, however improbable, will eventually happen.
> It will mean that the human error of promoting non passing code to the
> gate will get stopped. That will help quite a bit. A few of us have
> been manually pruning those changes out of the gate, and that helped
> build up merge velocity again. The system will now work like we've
> seen it needs to.
> ==========================
> Executive Summary
> ==========================
> To summarize, the effects of these changes will be:
>  - 1) Decrease the impact of failures resetting the entire gate queue
>    by doing the heavy testing in the check queue where changes are not
>    dependent on each other.
>  - 2) Run a slimmer set of jobs in the gate queue to maintain sanity,
>    but not block as much on existing bugs in OpenStack.
>  - 3) As a result, this should increase our confidence that changes
>    put into the gate will pass. This will help prevent gate resets,
>    and the disruption they cause by needing to invalidate and restart
>    the whole gate queue.
> And we'll be making getting this working a top priority, so we'll be
> ready for Icehouse-3.
> --
> Sean Dague
> Samsung Research America
> sean at dague.net / sean.dague at samsung.com
> http://dague.net
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Sounds like a great strategy Sean, yourself and everyone involved feel
free to grab me on IRC if there's anything I can help with.

More information about the OpenStack-dev mailing list