[OpenStack-Infra] Suggestion for helping gate users deal with crisis...

Clint Byrum clint at fewbar.com
Tue Sep 24 16:42:19 UTC 2013


Hello infra rockstars. First and foremost, thank you for keeping the
well oiled machinery of the OpenStack infrastructure running. It is a
marvel of modern engineering, and I am not just saying that because I
am prone to hyperbole.

Last night while the gate was exploding a few of us noticed, and
weren't really sure what to do. ttx was whitelisted in the statusbot,
but untrained in how to handle it. I dug through jenkins configs and
logs but I am completely ignorant of zuul and thus would have done
more damage than not had I been able to coax anything out of the system
(luckily I am also completely unprivileged.. good job :).

I'd like to suggest that infra develop a play book for dealing
with crisis. This is not just for those of you with the power to fix
things. This is a public document that helps people understand what to do,
who to wait for, who to contact, and how to do so, when things are broken.

Statusbot works well as an "Office Barbrady" style "nothing to see here,
move along", but it is not so useful in helping to get the ball rolling
on a solution after hours in the US. Had there been a play book with
roles listed, the statusbot would have been made use of. As in:

"In the event of any failure, the statusbot should be updated by someone
who is whitelisted in this file [link to the file] in git." Those
individuals can send a message in this format, in #openstack-infra to
update the status:

   #status Pypi mirror problems causing gate failures. Please stand-by...

This should be in a wiki page or published document somewhere that is
linked "basically everywhere". This allows those who see a failure as
a crisis to click through, and find a warm fuzzy of options to take. It
also helps take the burden off the infra team for educating everyone on
how to deal with crisis. It is especially helpful in scaling the team
out, as new members can learn how the team operates in general via the
play book rather than having to wait for a crisis to happen.

Anyway, just a suggestion. As I don't know the plays, I cannot write
this page, but I would have been able to share the link with the few
others who were affected by the outage last night, and that might have
reduced their stress level a bit.



More information about the OpenStack-Infra mailing list