Open Stack

Sat Feb 11 00:10:24 UTC 2012

Hi all!

It's been a rough week from a CI and dev infrastructure perspective. I
wanted to let people know what happened, what we did about it, and where
it's going.

First of all, we had 4 different things happen all this week. (when it
rains is pours)

Rackspace Cloud changed the compute service name
github https connections started hanging
Ubuntu Oneiric updated their kernel
Rackspace Cloud started handing us servers without networking

I'll cover them one by one, but first, for those of you who don't know -
we spin up cloud servers on demand on an account that Rackspace has
provided. Normally, this is a great thing. Sometimes, being a cloud,
it's shaky, and we do a number of things to guard against that,
including pre-creating a pool of servers. Sometimes, the world conspires
against us and all of that is for naught.

As part of a longer term solution, we have an engineer working on
completing a plugin for Jenkins that will handle all of the provisioning
BEFORE test time - so that if we can't spin up nodes to run tests on,
we'll simply queue up tests rather than cause failures. I'm mentioning
that because you are all an audience of cloud engineers, so I don't want
you to think we're not working on the real solution. However, that's
still probably 3 or 4 weeks out from being finished, so in the meantime,
we have to do this.

Now, for the details:

1) Rackspace Cloud changed the compute service name

The Cloud Servers API changed the service name this week from
cloudServers to cloudServersLegacy. This caused libcloud, which is the
basis of the scripts that we use to provision nodes for devstack
integration tests, to fail, which meant that the job that spins up our
pool of available servers wasn't able to replenish the pool.

Once we identified the problem (with the help of the libcloud folks), we
put in a local patch that uses both names until Rackspace rolled back
the service name change.  But there were several hours in there where we
simply couldn't spin up servers. This was the basis of a large portion
of yesterday's problems.

2) github https connections started hanging

We had a few intermittent github outages this week. Normally this
shouldn't be too much of a problem, but (lucky us) we uncovered a bug
with the URL Change Trigger plugin for Jenkins that we were using. The
bug was that it wasn't setting a TCP connect timeout, so if the TCP
connection just hangs, the connect would hang. Still not a huge deal,
right? WELL - we use that plugin as a part of a scheduled job, which
runs inside of the Cron thread inside of Jenkins ... so what happened
was that the TCP hang cause the thread to jam, which caused ALL jobs
that ran off of a scheduled timer to just stop running. This is the
reason for the exhaustion of devstack nodes Tuesday, Wednesday and Thursday.

Once we figured out what was going on, we patched the problem, submitted
it upstream and they made a new release, which we upgraded to
yesterday... so we should not suffer from this problem again.

Longer term, we're finishing work on a patch to the gerrit trigger
plugin so that we can stop polling github for post-merge changes and
instead just respond to merge events in gerrit. (ping me if you want to
know why we need to write a patch for that)

3) Ubuntu Oneiric updated their kernel

We're still working on the why of the breakage here. We update the base
image we use for launching devstack nodes nightly, so that the spin up
time is lower, but due to intermittent cloud issues, that hasn't been
working properly for a few days. Last night it started working again,
and the base image updated. Unfortunately, an update to Ubuntu itself
left us without some headers that we need for iscsi to work properly.
This borked up the stable/diablo branch testing for nova pretty hard.
We've fixed this moving forward by explicitly adding the package that
has the headers... the question of why it worked before the update is
still under investigation.

Longer term we're going to construct these test nodes in a different
way, and we've discussed applying gating logic to them so that we don't
add the new node base as a usuable node base until it's passed the trunk
tests. (I'd personally like to do this with updating many of our
depends, but there is some structure we need to chat about there,
probably at the next ODS)

As if that wasn't enough fun for one week:

4) Rackspace Cloud started handing us servers without networking

Certainly not pointing fingers here - again, we're not really sure
what's up with this one, but we started getting servers without working
networking. The fix for this one is ostensibly simple, which is to test
that the node we've spun up can actually take an ssh connection before
we add it to the pool of available nodes. Again, once we're running with
the jclouds plugin, jenkins will actually just keep trying to make nodes
until it can ssh in to one, so this problem will also cease to be.

Anyhow - sorry for the hiccups this week. We're trying to balance
dealing with the fires as they happen with solving their root causes -
so sometimes there's a little bit more of a lag before a fix than we'd
like. Here's hoping that next week doesn't bring us quite as much fun.

Have a great weekend!

Monty

Open Stack

[Openstack] a week of shaky dev infrastructure

OpenStack

Community

Documentation

Branding & Legal