[openstack-dev] [IMPORTANT] The Gate around Feature Freeze

Dolph Mathews dolph.mathews at gmail.com
Fri Aug 23 02:37:07 UTC 2013

On Thu, Aug 22, 2013 at 7:48 PM, James E. Blair <jeblair at openstack.org>wrote:

> Monty Taylor <mordred at inaugust.com> writes:
> > The infra team has done a lot of work in prep for our favorite time of
> > year, and we've actually landed several upgrades to the gate without
> > which we'd be in particularly bad shape right now. (I'll let Jim write
> > about some of them later when he's not battling the current operational
> > issues - they're pretty spectacular) As with many scaling issues, some
> > of these upgrades have resulted in moving the point of pain further
> > along the stack. We're working on solutions to the current pain points.
> > (Or, I should say they are, because I'm on a plane headed to Burning Man
> > and not useful for much other than writing emails.)
> Hi!
> The good news is that a lot of the operational problems over the past
> few days have been corrected, we are now pretty close to the noise floor
> of infrastructure issues in the gate, and over the next few days we'll
> work to get rid of the remaining bugs.
> As I'm sure everyone knows, we've seen a huge growth in the project, the
> number of changes, and the number of tests we run.  That is both
> wonderful, and a little terrifying!  But we haven't been idle: we have
> made some significant improvements and innovations to the project
> infrastructure to deal with our growing load, especially during these
> peak times.
> About a year ago, we realized that the growing number of jobs run (and
> number of test machines on which we run those jobs) was going to cause
> scaling issues with Jenkins.  So with the help of Khai Do, we created
> the gearman-plugin[1] for Jenkins, and then we modified Zuul to use it.
> That means that Zuul isn't directly tied to Jenkins anymore, and can
> distribute the jobs it needs to run to anything that can run them via
> Gearman.
> A few weeks ago we took advantage of that by adding two new Jenkins
> masters to our system, giving us one of the first (if not the first)
> multi-master Jenkins systems.  Since then, all of the test jobs have
> been run on nodes attached to either jenkins01.openstack.org or
> jenkins02.openstack.org (which you may have seen linked to from the Zuul
> status page).  That has given us the ability to upgrade Jenkins and its
> plugins with no interruption due to the active-active nature of the
> system.  And we can add hundreds of test nodes to each of these systems
> and continue to scale them horizontally as our load increases.
> With Jenkins now able to scale, the next bottleneck was the number of
> test nodes.  Until recently, we had a handful of special Jenkins jobs
> which would launch and destroy the single-use nodes that are used for
> devstack tests.  We were seeing issues with Jenkins running those jobs,
> as well as their ability to keep up with demand.  So we started the
> Nodepool project[2] to create a daemon that could keep up with the
> demand for test nodes, be much more responsive, and eliminate some of
> the occasional errors that we would see in the old Rube-Goldberg system
> we had for managing nodes.
> In anticipation of the rush of patches for the feature freeze, we rolled
> that out over the weekend so it was ready to go Monday.  And it worked!
> In fact, it's extremely responsive.  It immediately utilized our entire
> capacity to supply test nodes.  Which was great, except that a lot of
> our tests are configured to use the git repos from Gerrit, which is why
> Gerrit was very slow early in the week.  Fortunately, Elizabeth Krumbach
> Joseph has been working on setting up a new Git server.  That alone is
> pretty exciting, and she's going to send an announcement about it soon.
> Since it was ready to go, we moved the test load from Gerrit to the new
> git server, which has made Gerrit much more responsive again.
> Unfortunately, the new git server still wasn't quite able to keep up
> with the test load, so Clark Boylan, Elizabeth and I have spent some
> time tuning it as well as load-balancing it across several hosts.
> That is now in place, and the new system seems able to cope with the
> load from the current rush of patches.
> We're still seeing an occasional issue where a job is reported as LOST
> because Jenkins is apparently unaware that it can't talk to the test
> node.  We have some workarounds in progress that we hope to have in
> place soon.
> Our goal is to have the most robust and accurate test system possible,
> that can run all of the tests we can think to throw at it.  I think the
> improvements we've made recently are going to help tremendously and I'm
> pretty excited!  As always, if you'd like to pitch in, stop by
> #openstack-infra on Freenode and see what we're up to.

Wow, nice work! Thank you, infra!

> -Jim
> [1] http://git.openstack.org/cgit/openstack-infra/gearman-plugin/
> [2] http://git.openstack.org/cgit/openstack-infra/nodepool/
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20130822/f832dd4c/attachment.html>

More information about the OpenStack-dev mailing list