[openstack-dev] [IMPORTANT] The Gate around Feature Freeze
Russell Bryant
rbryant at redhat.com
Fri Aug 23 19:24:53 UTC 2013
On 08/22/2013 10:37 PM, Dolph Mathews wrote:
>
> On Thu, Aug 22, 2013 at 7:48 PM, James E. Blair <jeblair at openstack.org
> <mailto:jeblair at openstack.org>> wrote:
>
> Monty Taylor <mordred at inaugust.com <mailto:mordred at inaugust.com>>
> writes:
>
> > The infra team has done a lot of work in prep for our favorite time of
> > year, and we've actually landed several upgrades to the gate without
> > which we'd be in particularly bad shape right now. (I'll let Jim write
> > about some of them later when he's not battling the current
> operational
> > issues - they're pretty spectacular) As with many scaling issues, some
> > of these upgrades have resulted in moving the point of pain further
> > along the stack. We're working on solutions to the current pain
> points.
> > (Or, I should say they are, because I'm on a plane headed to
> Burning Man
> > and not useful for much other than writing emails.)
>
> Hi!
>
> The good news is that a lot of the operational problems over the past
> few days have been corrected, we are now pretty close to the noise floor
> of infrastructure issues in the gate, and over the next few days we'll
> work to get rid of the remaining bugs.
>
> As I'm sure everyone knows, we've seen a huge growth in the project, the
> number of changes, and the number of tests we run. That is both
> wonderful, and a little terrifying! But we haven't been idle: we have
> made some significant improvements and innovations to the project
> infrastructure to deal with our growing load, especially during these
> peak times.
>
> About a year ago, we realized that the growing number of jobs run (and
> number of test machines on which we run those jobs) was going to cause
> scaling issues with Jenkins. So with the help of Khai Do, we created
> the gearman-plugin[1] for Jenkins, and then we modified Zuul to use it.
> That means that Zuul isn't directly tied to Jenkins anymore, and can
> distribute the jobs it needs to run to anything that can run them via
> Gearman.
>
> A few weeks ago we took advantage of that by adding two new Jenkins
> masters to our system, giving us one of the first (if not the first)
> multi-master Jenkins systems. Since then, all of the test jobs have
> been run on nodes attached to either jenkins01.openstack.org
> <http://jenkins01.openstack.org> or
> jenkins02.openstack.org <http://jenkins02.openstack.org> (which you
> may have seen linked to from the Zuul
> status page). That has given us the ability to upgrade Jenkins and its
> plugins with no interruption due to the active-active nature of the
> system. And we can add hundreds of test nodes to each of these systems
> and continue to scale them horizontally as our load increases.
>
> With Jenkins now able to scale, the next bottleneck was the number of
> test nodes. Until recently, we had a handful of special Jenkins jobs
> which would launch and destroy the single-use nodes that are used for
> devstack tests. We were seeing issues with Jenkins running those jobs,
> as well as their ability to keep up with demand. So we started the
> Nodepool project[2] to create a daemon that could keep up with the
> demand for test nodes, be much more responsive, and eliminate some of
> the occasional errors that we would see in the old Rube-Goldberg system
> we had for managing nodes.
>
> In anticipation of the rush of patches for the feature freeze, we rolled
> that out over the weekend so it was ready to go Monday. And it worked!
>
> In fact, it's extremely responsive. It immediately utilized our entire
> capacity to supply test nodes. Which was great, except that a lot of
> our tests are configured to use the git repos from Gerrit, which is why
> Gerrit was very slow early in the week. Fortunately, Elizabeth Krumbach
> Joseph has been working on setting up a new Git server. That alone is
> pretty exciting, and she's going to send an announcement about it soon.
> Since it was ready to go, we moved the test load from Gerrit to the new
> git server, which has made Gerrit much more responsive again.
> Unfortunately, the new git server still wasn't quite able to keep up
> with the test load, so Clark Boylan, Elizabeth and I have spent some
> time tuning it as well as load-balancing it across several hosts.
>
> That is now in place, and the new system seems able to cope with the
> load from the current rush of patches.
>
> We're still seeing an occasional issue where a job is reported as LOST
> because Jenkins is apparently unaware that it can't talk to the test
> node. We have some workarounds in progress that we hope to have in
> place soon.
>
> Our goal is to have the most robust and accurate test system possible,
> that can run all of the tests we can think to throw at it. I think the
> improvements we've made recently are going to help tremendously and I'm
> pretty excited! As always, if you'd like to pitch in, stop by
> #openstack-infra on Freenode and see what we're up to.
>
>
> Wow, nice work! Thank you, infra!
+1000
I am continually amazed by the work you guys do. It has been a key
factor in our ability to move so fast. Thanks for everything!
--
Russell Bryant
More information about the OpenStack-dev
mailing list