<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div>I'm not sure how well it would work here, but I've used <a href="http://Pagerduty.com">Pagerduty.com</a> for something similar before.</div><div><br></div><div>The big up side of pagerduty is that it is pretty good at contacting people who aren't at their computers. </div><div><br></div><div>It supports email notifications and webhooks for people who want to lots of control over what to do with alerts; push notifications to iOS or Android; and SMS or phone call as a last resort. Each person can configure their own alerts to suit them</div><div><br></div><div>It handles escalating unhandled alerts, including looping back to the start if it can't reach anyone.</div><div><br></div><div>It allows incidents to be handed over to arbitrary people regardless of who the roster says is on call, and for schedule overrides when a shift has to be reassigned.</div><div><br></div><div>Incidents can be created via REST, email, or (I think) webhooks, so it's easy for users or for automated systems to raise an alarm</div><div><br></div><div>It has some drawbacks: it would force us to define a rotation (or several rotations, one for each region, if we want to follow the sun), and someone needs to pay for it.</div><div><br></div><div>I think it handles most of what we want though. It gives infra admins a bat-signal to request urgent help, and it gives us a way to ping other team members when we need to hand over. It isn't very good for $randoms asking for low-priority issues though - it treats every incident as equally urgent.</div><div><br></div><div>I haven't used it, but opsgenie seems to have a similar set of features (more, if <a href="https://www.opsgenie.com/pagerduty">https://www.opsgenie.com/pagerduty</a> is to be believed) </div><div><br></div><div><br>On 26 Feb 2014, at 9:30 am, Robert Collins <<a href="mailto:robertc@robertcollins.net">robertc@robertcollins.net</a>> wrote:<br><br></div><blockquote type="cite"><div><span>In the tripleo meeting today we re-affirmed that the tripleo-cd-admins</span><br><span>team is aimed at delivering production-availability clouds - thats how</span><br><span>we know the the tripleo program is succeeding (or not !).</span><br><span></span><br><span>So if you're a member of that team, you're on the hook - effectively</span><br><span>on call, where production issues will take precedence over development</span><br><span>/ bug fixing etc.</span><br><span></span><br><span>We have the following clouds today:</span><br><span>cd-undercloud (baremetal, one per region)</span><br><span>cd-overcloud (KVM in the HP region, not sure yet for the RH region) -</span><br><span>multi region.</span><br><span>ci-overcloud (same as cd-overcloud, and will go away when cd-overcloud</span><br><span>is robust enough).</span><br><span></span><br><span>And we have two users:</span><br><span> - TripleO ATCs, all of whom are eligible for accounts on *-overcloud</span><br><span> - TripleO reviewers, indirectly via openstack-infra who provide 99%</span><br><span>of the load on the clouds</span><br><span></span><br><span>Right now when there is a problem, there's no clearly defined 'get</span><br><span>hold of someone' mechanism other than IRC in #tripleo.</span><br><span></span><br><span>And thats pretty good since most of the admins are on IRC most of the time.</span><br><span></span><br><span>But.</span><br><span></span><br><span>There are two holes - a) what if its sunday evening :) and b) what if</span><br><span>someone (for instance Derek) has been troubleshooting a problem, but</span><br><span>needs to go do personal stuff, or you know, sleep. There's no reliable</span><br><span>defined handoff mechanism.</span><br><span></span><br><span>So - I think we need to define two things:</span><br><span> - a stock way for $randoms to ask for support w/ these clouds that</span><br><span>will be fairly low latency and reliable.</span><br><span> - a way for us to escalate to each other *even if folk happen to be</span><br><span>away from the keyboard at the time*.</span><br><span>And possibly a third:</span><br><span> - a way for openstack-infra admins to escalate to us in the event of</span><br><span>OMG things happening. Like, we send 1000 VMs all at once at their git</span><br><span>mirrors or something.</span><br><span></span><br><span>And with that lets open the door for ideas!</span><br><span></span><br><span>-Rob</span><br><span>-- </span><br><span>Robert Collins <<a href="mailto:rbtcollins@hp.com">rbtcollins@hp.com</a>></span><br><span>Distinguished Technologist</span><br><span>HP Converged Cloud</span><br><span></span><br><span>_______________________________________________</span><br><span>OpenStack-dev mailing list</span><br><span><a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a></span><br><span><a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a></span><br></div></blockquote></body></html>