<div dir="ltr">Hi<div class="gmail_extra"><br><div class="gmail_quote">On 25 February 2014 14:30, Robert Collins <span dir="ltr"><<a href="mailto:robertc@robertcollins.net" target="_blank">robertc@robertcollins.net</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">So - I think we need to define two things:<br>
- a stock way for $randoms to ask for support w/ these clouds that<br>
will be fairly low latency and reliable.<br>
- a way for us to escalate to each other *even if folk happen to be<br>
away from the keyboard at the time*.<br>
And possibly a third:<br>
- a way for openstack-infra admins to escalate to us in the event of<br>
OMG things happening. Like, we send 1000 VMs all at once at their git<br>
mirrors or something.<br></blockquote><div><br></div><div>I think action zero is to define an SLA, so everyone has a very clear picture of what to expect from us, and we have a clear picture of what we're signing up to provide.</div>
<div><br></div><div>Also, I'd note that talking about non-IRC escalation methods, coverage of weekends, etc. is moving us into a pretty different realm than we have been in, so it might be worth checking that all the current people (who might not all have been in the meeting) are ok with fixing a cloud on a Sunday :)</div>
<div><br></div><div>Then we need to map out who can be contacted at any given time of week, and how they can be contacted. Hopefully follow-the-sun covers us with normal working hours, apart from the gap between US/Pacific finishing their week, and New Zealand starting the next week. Since we're essentially relying on volunteer efforts to service these production clouds, we would need to let people be pretty flexible about when they can be contacted.</div>
<div><br></div><div>Then we need to publish that information somewhere that the relevant folk can see and some kind of monitoring that can escalate beyond IRC if it's not getting a response. James mentioned Pagerduty and I've had good experiences with it in previous operational roles.</div>
</div><div><br></div><div>Then we need to write a playbook so each outage isn't a voyage of discovery unless it's something completely new, and commit to updating the playbook after each outage, with what we learned that time.</div>
<div><br></div><div>Have we considered reaching out to OpenStack sponsors who have operational folk, to see if they would be interested in contributing human resources to this?</div><div><br></div>-- <br><div dir="ltr">Cheers,<div>
<br></div><div>Chris</div></div>
</div></div>