[openstack-dev] [IMPORTANT] The Gate around Feature Freeze

James E. Blair jeblair at openstack.org
Fri Aug 23 00:48:33 UTC 2013

Monty Taylor <mordred at inaugust.com> writes:

> The infra team has done a lot of work in prep for our favorite time of
> year, and we've actually landed several upgrades to the gate without
> which we'd be in particularly bad shape right now. (I'll let Jim write
> about some of them later when he's not battling the current operational
> issues - they're pretty spectacular) As with many scaling issues, some
> of these upgrades have resulted in moving the point of pain further
> along the stack. We're working on solutions to the current pain points.
> (Or, I should say they are, because I'm on a plane headed to Burning Man
> and not useful for much other than writing emails.)


The good news is that a lot of the operational problems over the past
few days have been corrected, we are now pretty close to the noise floor
of infrastructure issues in the gate, and over the next few days we'll
work to get rid of the remaining bugs.

As I'm sure everyone knows, we've seen a huge growth in the project, the
number of changes, and the number of tests we run.  That is both
wonderful, and a little terrifying!  But we haven't been idle: we have
made some significant improvements and innovations to the project
infrastructure to deal with our growing load, especially during these
peak times.

About a year ago, we realized that the growing number of jobs run (and
number of test machines on which we run those jobs) was going to cause
scaling issues with Jenkins.  So with the help of Khai Do, we created
the gearman-plugin[1] for Jenkins, and then we modified Zuul to use it.
That means that Zuul isn't directly tied to Jenkins anymore, and can
distribute the jobs it needs to run to anything that can run them via

A few weeks ago we took advantage of that by adding two new Jenkins
masters to our system, giving us one of the first (if not the first)
multi-master Jenkins systems.  Since then, all of the test jobs have
been run on nodes attached to either jenkins01.openstack.org or
jenkins02.openstack.org (which you may have seen linked to from the Zuul
status page).  That has given us the ability to upgrade Jenkins and its
plugins with no interruption due to the active-active nature of the
system.  And we can add hundreds of test nodes to each of these systems
and continue to scale them horizontally as our load increases.

With Jenkins now able to scale, the next bottleneck was the number of
test nodes.  Until recently, we had a handful of special Jenkins jobs
which would launch and destroy the single-use nodes that are used for
devstack tests.  We were seeing issues with Jenkins running those jobs,
as well as their ability to keep up with demand.  So we started the
Nodepool project[2] to create a daemon that could keep up with the
demand for test nodes, be much more responsive, and eliminate some of
the occasional errors that we would see in the old Rube-Goldberg system
we had for managing nodes.

In anticipation of the rush of patches for the feature freeze, we rolled
that out over the weekend so it was ready to go Monday.  And it worked!

In fact, it's extremely responsive.  It immediately utilized our entire
capacity to supply test nodes.  Which was great, except that a lot of
our tests are configured to use the git repos from Gerrit, which is why
Gerrit was very slow early in the week.  Fortunately, Elizabeth Krumbach
Joseph has been working on setting up a new Git server.  That alone is
pretty exciting, and she's going to send an announcement about it soon.
Since it was ready to go, we moved the test load from Gerrit to the new
git server, which has made Gerrit much more responsive again.
Unfortunately, the new git server still wasn't quite able to keep up
with the test load, so Clark Boylan, Elizabeth and I have spent some
time tuning it as well as load-balancing it across several hosts.

That is now in place, and the new system seems able to cope with the
load from the current rush of patches.

We're still seeing an occasional issue where a job is reported as LOST
because Jenkins is apparently unaware that it can't talk to the test
node.  We have some workarounds in progress that we hope to have in
place soon.

Our goal is to have the most robust and accurate test system possible,
that can run all of the tests we can think to throw at it.  I think the
improvements we've made recently are going to help tremendously and I'm
pretty excited!  As always, if you'd like to pitch in, stop by
#openstack-infra on Freenode and see what we're up to.


[1] http://git.openstack.org/cgit/openstack-infra/gearman-plugin/
[2] http://git.openstack.org/cgit/openstack-infra/nodepool/

