[openstack-dev] [IMPORTANT] The Gate around Feature Freeze

Russell Bryant rbryant at redhat.com
Fri Aug 23 19:24:53 UTC 2013


On 08/22/2013 10:37 PM, Dolph Mathews wrote:
> 
> On Thu, Aug 22, 2013 at 7:48 PM, James E. Blair <jeblair at openstack.org
> <mailto:jeblair at openstack.org>> wrote:
> 
>     Monty Taylor <mordred at inaugust.com <mailto:mordred at inaugust.com>>
>     writes:
> 
>     > The infra team has done a lot of work in prep for our favorite time of
>     > year, and we've actually landed several upgrades to the gate without
>     > which we'd be in particularly bad shape right now. (I'll let Jim write
>     > about some of them later when he's not battling the current
>     operational
>     > issues - they're pretty spectacular) As with many scaling issues, some
>     > of these upgrades have resulted in moving the point of pain further
>     > along the stack. We're working on solutions to the current pain
>     points.
>     > (Or, I should say they are, because I'm on a plane headed to
>     Burning Man
>     > and not useful for much other than writing emails.)
> 
>     Hi!
> 
>     The good news is that a lot of the operational problems over the past
>     few days have been corrected, we are now pretty close to the noise floor
>     of infrastructure issues in the gate, and over the next few days we'll
>     work to get rid of the remaining bugs.
> 
>     As I'm sure everyone knows, we've seen a huge growth in the project, the
>     number of changes, and the number of tests we run.  That is both
>     wonderful, and a little terrifying!  But we haven't been idle: we have
>     made some significant improvements and innovations to the project
>     infrastructure to deal with our growing load, especially during these
>     peak times.
> 
>     About a year ago, we realized that the growing number of jobs run (and
>     number of test machines on which we run those jobs) was going to cause
>     scaling issues with Jenkins.  So with the help of Khai Do, we created
>     the gearman-plugin[1] for Jenkins, and then we modified Zuul to use it.
>     That means that Zuul isn't directly tied to Jenkins anymore, and can
>     distribute the jobs it needs to run to anything that can run them via
>     Gearman.
> 
>     A few weeks ago we took advantage of that by adding two new Jenkins
>     masters to our system, giving us one of the first (if not the first)
>     multi-master Jenkins systems.  Since then, all of the test jobs have
>     been run on nodes attached to either jenkins01.openstack.org
>     <http://jenkins01.openstack.org> or
>     jenkins02.openstack.org <http://jenkins02.openstack.org> (which you
>     may have seen linked to from the Zuul
>     status page).  That has given us the ability to upgrade Jenkins and its
>     plugins with no interruption due to the active-active nature of the
>     system.  And we can add hundreds of test nodes to each of these systems
>     and continue to scale them horizontally as our load increases.
> 
>     With Jenkins now able to scale, the next bottleneck was the number of
>     test nodes.  Until recently, we had a handful of special Jenkins jobs
>     which would launch and destroy the single-use nodes that are used for
>     devstack tests.  We were seeing issues with Jenkins running those jobs,
>     as well as their ability to keep up with demand.  So we started the
>     Nodepool project[2] to create a daemon that could keep up with the
>     demand for test nodes, be much more responsive, and eliminate some of
>     the occasional errors that we would see in the old Rube-Goldberg system
>     we had for managing nodes.
> 
>     In anticipation of the rush of patches for the feature freeze, we rolled
>     that out over the weekend so it was ready to go Monday.  And it worked!
> 
>     In fact, it's extremely responsive.  It immediately utilized our entire
>     capacity to supply test nodes.  Which was great, except that a lot of
>     our tests are configured to use the git repos from Gerrit, which is why
>     Gerrit was very slow early in the week.  Fortunately, Elizabeth Krumbach
>     Joseph has been working on setting up a new Git server.  That alone is
>     pretty exciting, and she's going to send an announcement about it soon.
>     Since it was ready to go, we moved the test load from Gerrit to the new
>     git server, which has made Gerrit much more responsive again.
>     Unfortunately, the new git server still wasn't quite able to keep up
>     with the test load, so Clark Boylan, Elizabeth and I have spent some
>     time tuning it as well as load-balancing it across several hosts.
> 
>     That is now in place, and the new system seems able to cope with the
>     load from the current rush of patches.
> 
>     We're still seeing an occasional issue where a job is reported as LOST
>     because Jenkins is apparently unaware that it can't talk to the test
>     node.  We have some workarounds in progress that we hope to have in
>     place soon.
> 
>     Our goal is to have the most robust and accurate test system possible,
>     that can run all of the tests we can think to throw at it.  I think the
>     improvements we've made recently are going to help tremendously and I'm
>     pretty excited!  As always, if you'd like to pitch in, stop by
>     #openstack-infra on Freenode and see what we're up to.
> 
> 
> Wow, nice work! Thank you, infra!

+1000

I am continually amazed by the work you guys do.  It has been a key
factor in our ability to move so fast.  Thanks for everything!

-- 
Russell Bryant



More information about the OpenStack-dev mailing list