[openstack-dev] [Tripleo] tripleo-cd-admins team update / contact info question

James Polley jp at jamezpolley.com
Wed Feb 26 08:04:15 UTC 2014


I'm not sure how well it would work here, but I've used Pagerduty.com for something similar before.

The big up side of pagerduty is that it is pretty good at contacting people who aren't at their computers. 

It supports email notifications and webhooks for people who want to lots of control over what to do with alerts; push notifications to iOS or Android; and SMS or phone call as a last resort. Each person can configure their own alerts to suit them

It handles escalating unhandled alerts, including looping back to the start if it can't reach anyone.

It allows incidents to be handed over to arbitrary people regardless of who the roster says is on call, and for schedule overrides when a shift has to be reassigned.

Incidents can be created via REST, email, or (I think) webhooks, so it's easy for users or for automated systems to raise an alarm

It has some drawbacks: it would force us to define a rotation (or several rotations, one for each region, if we want to follow the sun), and someone needs to pay for it.

I think it handles most of what we want though. It gives infra admins a bat-signal to request urgent help, and it gives us a way to ping other team members when we need to hand over. It isn't very good for $randoms asking for low-priority issues though - it treats every incident as equally urgent.

I haven't used it, but opsgenie seems to have a similar set of features (more, if https://www.opsgenie.com/pagerduty is to be believed) 


> On 26 Feb 2014, at 9:30 am, Robert Collins <robertc at robertcollins.net> wrote:
> 
> In the tripleo meeting today we re-affirmed that the tripleo-cd-admins
> team is aimed at delivering production-availability clouds - thats how
> we know the the tripleo program is succeeding (or not !).
> 
> So if you're a member of that team, you're on the hook - effectively
> on call, where production issues will take precedence over development
> / bug fixing etc.
> 
> We have the following clouds today:
> cd-undercloud (baremetal, one per region)
> cd-overcloud (KVM in the HP region, not sure yet for the RH region) -
> multi region.
> ci-overcloud (same as cd-overcloud, and will go away when cd-overcloud
> is robust enough).
> 
> And we have two users:
> - TripleO ATCs, all of whom are eligible for accounts on *-overcloud
> - TripleO reviewers, indirectly via openstack-infra who provide 99%
> of the load on the clouds
> 
> Right now when there is a problem, there's no clearly defined 'get
> hold of someone' mechanism other than IRC in #tripleo.
> 
> And thats pretty good since most of the admins are on IRC most of the time.
> 
> But.
> 
> There are two holes - a) what if its sunday evening :) and b) what if
> someone (for instance Derek) has been troubleshooting a problem, but
> needs to go do personal stuff, or you know, sleep. There's no reliable
> defined handoff mechanism.
> 
> So - I think we need to define two things:
>  - a stock way for $randoms to ask for support w/ these clouds that
> will be fairly low latency and reliable.
>  - a way for us to escalate to each other *even if folk happen to be
> away from the keyboard at the time*.
> And possibly a third:
>  - a way for openstack-infra admins to escalate to us in the event of
> OMG things happening. Like, we send 1000 VMs all at once at their git
> mirrors or something.
> 
> And with that lets open the door for ideas!
> 
> -Rob
> -- 
> Robert Collins <rbtcollins at hp.com>
> Distinguished Technologist
> HP Converged Cloud
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140226/bb584067/attachment.html>


More information about the OpenStack-dev mailing list