[OpenStack-Infra] gate outage Friday night

Monty Taylor mordred at inaugust.com
Sat Jul 13 04:17:25 UTC 2013


Hey all,

Quick note about Friday night's gate outage so that we can post-mortem
it later.

Best we can tell - there was a network incident at HP where things went
horribly wrong. During that period, our hypotheis is that we interpreted
failure responses from our slave as "slave is gone, delete from db" when
the slave was in fact still there, which then led to overrunning our
quota due to slaves that needed deleting but we'd stopped knowing about.
We do not have proof of this - it's a hypothesis.

We (and by we I mean fungi) manually deleted all of the slaves.

The problem was noticed by "lots of lost jobs showing up on status
page". That makes me think that perhaps that's a metric that would be
useful to track. Perhaps "number of lost jobs" and "number of jobs" so
both 'lost-jobs-per-X' could be a thing we care about, but also
'%-jobs-lost-per-X'

THEN - once things started coming back up, they were unable to properly
connect to jenkins.  Again, hypothesis being that the rampant slave
failures put jenkins into a bad state. We restarted it.

After deleting all of the slaves and restarting jenkins, all appears to
be good now.

Monty



More information about the OpenStack-Infra mailing list