[OpenStack-Infra] Suggestion for helping gate users deal with crisis...

Clint Byrum clint at fewbar.com
Tue Sep 24 18:34:49 UTC 2013


Excerpts from jeblair's message of 2013-09-24 10:15:10 -0700:
> Clint Byrum <clint at fewbar.com> writes:
> > I'd like to suggest that infra develop a play book for dealing
> > with crisis. This is not just for those of you with the power to fix
> > things. This is a public document that helps people understand what to do,
> > who to wait for, who to contact, and how to do so, when things are broken.
> >
> > Statusbot works well as an "Office Barbrady" style "nothing to see here,
> > move along", but it is not so useful in helping to get the ball rolling
> > on a solution after hours in the US. Had there been a play book with
> > roles listed, the statusbot would have been made use of. As in:
> >
> > "In the event of any failure, the statusbot should be updated by someone
> > who is whitelisted in this file [link to the file] in git." Those
> > individuals can send a message in this format, in #openstack-infra to
> > update the status:
> >
> >    #status Pypi mirror problems causing gate failures. Please stand-by...
> >
> > This should be in a wiki page or published document somewhere that is
> > linked "basically everywhere". This allows those who see a failure as
> > a crisis to click through, and find a warm fuzzy of options to take. It
> > also helps take the burden off the infra team for educating everyone on
> > how to deal with crisis. It is especially helpful in scaling the team
> > out, as new members can learn how the team operates in general via the
> > play book rather than having to wait for a crisis to happen.
> >
> > Anyway, just a suggestion. As I don't know the plays, I cannot write
> > this page, but I would have been able to share the link with the few
> > others who were affected by the outage last night, and that might have
> > reduced their stress level a bit.
> 
> Now you know where that documentation would live if it existed.  We try
> to document everything about the system on ci.openstack.org.  If you are
> at all curious about the project infrastructure, I highly recommend it.
> 
> Our goal is not to be project gatekeepers.  We have robots for that.
> Our goal is to facilitate everyone's participation in the project
> infrastructure.  In most cases, privileged access is not required to
> triage or solve problems.  I believe some was used in this case (to
> manually add a package to the pypi mirror in order to speed up the
> solution), but generally when something breaks, anyone in the project
> has the power to fix it.  The best way to get the ball rolling on a
> solution is for someone to start working on it.  And yes, then someone
> should post a status update so people know it's being worked on.
> 
> I'm wary of writing a playbook that says "if something breaks, contact
> so-and-so to fix it".  That's not how this thing works.  It's more like
> "if something breaks, start trying to fix it".  And while I understand
> that not everyone is capable of diagnosing and fixing every problem,
> quite a number of people have managed to track down, diagnose, and fix
> problems in this system without being infra rockstars.
> 

Indeed, the playbook I have in mind is not "this is how things are,
and these are the people who do the things". Ideally each playbook entry
is tied to a bug that has long term ordering problems or just a ton of
work pending.

> Keep in mind, this failure, and indeed, most failures are not
> infrastructure failures.  They are actually the gate working as
> designed.  It is _supposed_ to 'break' when it is not possible to test
> changes under the constraints we have set.  So a more traditional
> "enterprise service" playbook doesn't help -- there are no simple levers
> to pull, every such problem is different and requires a unique solution,
> and everyone is empowered to create that solution.
> 

Indeed, I have noticed that there are rarely easy answers, as all of
the easy answers are automated properly. :)

I think what I'm looking for is something along the lines of "this
is what is supposed to be happening right now", for those who are not
involved day to day in that decision stream.



More information about the OpenStack-Infra mailing list