[openstack-dev] [Tripleo] tripleo-cd-admins team update / contact info question

Derek Higgins derekh at redhat.com
Thu Feb 27 10:18:29 UTC 2014


On 25/02/14 22:30, Robert Collins wrote:
> In the tripleo meeting today we re-affirmed that the tripleo-cd-admins
> team is aimed at delivering production-availability clouds - thats how
> we know the the tripleo program is succeeding (or not !).
> 
> So if you're a member of that team, you're on the hook - effectively
> on call, where production issues will take precedence over development
> / bug fixing etc.
> 
> We have the following clouds today:
> cd-undercloud (baremetal, one per region)
> cd-overcloud (KVM in the HP region, not sure yet for the RH region) -
> multi region.
> ci-overcloud (same as cd-overcloud, and will go away when cd-overcloud
> is robust enough).
> 
> And we have two users:
>  - TripleO ATCs, all of whom are eligible for accounts on *-overcloud
>  - TripleO reviewers, indirectly via openstack-infra who provide 99%
> of the load on the clouds
> 
> Right now when there is a problem, there's no clearly defined 'get
> hold of someone' mechanism other than IRC in #tripleo.
> 
> And thats pretty good since most of the admins are on IRC most of the time.
> 
> But.
> 
> There are two holes - a) what if its sunday evening :) and b) what if
> someone (for instance Derek) has been troubleshooting a problem, but
> needs to go do personal stuff, or you know, sleep. There's no reliable
> defined handoff mechanism.
> 
> So - I think we need to define two things:
>   - a stock way for $randoms to ask for support w/ these clouds that
> will be fairly low latency and reliable.
>   - a way for us to escalate to each other *even if folk happen to be
> away from the keyboard at the time*.
> And possibly a third:
>   - a way for openstack-infra admins to escalate to us in the event of
> OMG things happening. Like, we send 1000 VMs all at once at their git
> mirrors or something.
> 
> And with that lets open the door for ideas!
> 
> -Rob
> 


I agree that is something that needs to happen, at the very least to aid
handover between team members, I wonder if we should start simple give
it a little time and progress from there to improve, here is what I
would suggest

o Handling outages would be done here
https://etherpad.openstack.org/p/cloud-outage
o Once outage occurs we jump on there to store relevant debug info
o Everybody on the admin list gets spammed on irc by a bot (we could put
instructions for the bot into the ethepad) e.g.
ircmessage: ci-overcloud down, compontent XXX not starting
o communicate on #tripleo was we are working on the problem
o For handover, we should keep enough in the etherpad to allow anybody
to get upto speed on the problem and history of whats been done so far
o We delete irrelevant garbage from etherpad once it becomes clear its
irrelevant

Once finished filebugs/write up summary/clear etherpad etc

infra (or any other user) could escalate to us by getting somebody on
#tripleo or adding something to the etherpad e.g.
ircmessage: ci-overcloud down sine 12:30UTC
the irc bot would take it from there

If it becomes obvious that we are not reacting quickly enough (or just
don't have enough redundancy of people in enough timezones to keep
somebody working on a problem) I think expanding the team a little might
be in order.

The disadvantage to using a etherpad is anybody could come along a
delete details without us noticing ...

thoughts?



More information about the OpenStack-dev mailing list