[Openstack] a week of shaky dev infrastructure

Jay Pipes jaypipes at gmail.com
Sat Feb 11 16:47:28 UTC 2012


Monty, this was a great description of the issues that plagued the CI 
system this week; thank you!

I'd like to point out that I very much admire the work you, Jim and 
Andrew Hutchings have been doing on the Gerrit and Jenkins tooling. I 
think the way that you work with upstream projects and make contributing 
fixes upstream a priority is something that we can all use as an example 
of excellent open source community work.

Cheers,
-jay

On 02/10/2012 07:10 PM, Monty Taylor wrote:
> Hi all!
>
> It's been a rough week from a CI and dev infrastructure perspective. I
> wanted to let people know what happened, what we did about it, and where
> it's going.
>
> First of all, we had 4 different things happen all this week. (when it
> rains is pours)
>
> Rackspace Cloud changed the compute service name
> github https connections started hanging
> Ubuntu Oneiric updated their kernel
> Rackspace Cloud started handing us servers without networking
>
> I'll cover them one by one, but first, for those of you who don't know -
> we spin up cloud servers on demand on an account that Rackspace has
> provided. Normally, this is a great thing. Sometimes, being a cloud,
> it's shaky, and we do a number of things to guard against that,
> including pre-creating a pool of servers. Sometimes, the world conspires
> against us and all of that is for naught.
>
> As part of a longer term solution, we have an engineer working on
> completing a plugin for Jenkins that will handle all of the provisioning
> BEFORE test time - so that if we can't spin up nodes to run tests on,
> we'll simply queue up tests rather than cause failures. I'm mentioning
> that because you are all an audience of cloud engineers, so I don't want
> you to think we're not working on the real solution. However, that's
> still probably 3 or 4 weeks out from being finished, so in the meantime,
> we have to do this.
>
> Now, for the details:
>
> 1) Rackspace Cloud changed the compute service name
>
> The Cloud Servers API changed the service name this week from
> cloudServers to cloudServersLegacy. This caused libcloud, which is the
> basis of the scripts that we use to provision nodes for devstack
> integration tests, to fail, which meant that the job that spins up our
> pool of available servers wasn't able to replenish the pool.
>
> Once we identified the problem (with the help of the libcloud folks), we
> put in a local patch that uses both names until Rackspace rolled back
> the service name change.  But there were several hours in there where we
> simply couldn't spin up servers. This was the basis of a large portion
> of yesterday's problems.
>
> 2) github https connections started hanging
>
> We had a few intermittent github outages this week. Normally this
> shouldn't be too much of a problem, but (lucky us) we uncovered a bug
> with the URL Change Trigger plugin for Jenkins that we were using. The
> bug was that it wasn't setting a TCP connect timeout, so if the TCP
> connection just hangs, the connect would hang. Still not a huge deal,
> right? WELL - we use that plugin as a part of a scheduled job, which
> runs inside of the Cron thread inside of Jenkins ... so what happened
> was that the TCP hang cause the thread to jam, which caused ALL jobs
> that ran off of a scheduled timer to just stop running. This is the
> reason for the exhaustion of devstack nodes Tuesday, Wednesday and Thursday.
>
> Once we figured out what was going on, we patched the problem, submitted
> it upstream and they made a new release, which we upgraded to
> yesterday... so we should not suffer from this problem again.
>
> Longer term, we're finishing work on a patch to the gerrit trigger
> plugin so that we can stop polling github for post-merge changes and
> instead just respond to merge events in gerrit. (ping me if you want to
> know why we need to write a patch for that)
>
> 3) Ubuntu Oneiric updated their kernel
>
> We're still working on the why of the breakage here. We update the base
> image we use for launching devstack nodes nightly, so that the spin up
> time is lower, but due to intermittent cloud issues, that hasn't been
> working properly for a few days. Last night it started working again,
> and the base image updated. Unfortunately, an update to Ubuntu itself
> left us without some headers that we need for iscsi to work properly.
> This borked up the stable/diablo branch testing for nova pretty hard.
> We've fixed this moving forward by explicitly adding the package that
> has the headers... the question of why it worked before the update is
> still under investigation.
>
> Longer term we're going to construct these test nodes in a different
> way, and we've discussed applying gating logic to them so that we don't
> add the new node base as a usuable node base until it's passed the trunk
> tests. (I'd personally like to do this with updating many of our
> depends, but there is some structure we need to chat about there,
> probably at the next ODS)
>
> As if that wasn't enough fun for one week:
>
> 4) Rackspace Cloud started handing us servers without networking
>
> Certainly not pointing fingers here - again, we're not really sure
> what's up with this one, but we started getting servers without working
> networking. The fix for this one is ostensibly simple, which is to test
> that the node we've spun up can actually take an ssh connection before
> we add it to the pool of available nodes. Again, once we're running with
> the jclouds plugin, jenkins will actually just keep trying to make nodes
> until it can ssh in to one, so this problem will also cease to be.
>
> Anyhow - sorry for the hiccups this week. We're trying to balance
> dealing with the fires as they happen with solving their root causes -
> so sometimes there's a little bit more of a lag before a fix than we'd
> like. Here's hoping that next week doesn't bring us quite as much fun.
>
> Have a great weekend!
>
> Monty
>
>
>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~openstack
> Post to     : openstack at lists.launchpad.net
> Unsubscribe : https://launchpad.net/~openstack
> More help   : https://help.launchpad.net/ListHelp




More information about the Openstack mailing list