[Openstack-sigs] [meta] Proposal for self-healing SIG
Adam Spiers
aspiers at suse.com
Sun Sep 17 22:35:02 UTC 2017
Hi all,
[TL;DR: we want to set up a "self-healing infrastructure" SIG.]
One of the biggest promises of the cloud vision was the idea that all
the infrastructure could be managed in a policy-driven fashion,
reacting to failures and other events by automatically healing and
optimising services. Most of the components required to implement
such an architecture already exist, e.g.
- Monasca: Monitoring
- Aodh: Alarming
- Congress: Policy-based governance
- Mistral: Workflow
- Senlin: Clustering
- Vitrage: Root Cause Analysis
- Watcher: Optimization
- Masakari: Compute plane HA
- Freezer-dr: DR and compute plane HA
However, there is not yet a clear strategy within the community for
how these should all tie together.
So at the PTG last week in Denver, we held an initial cross-project
meeting to discuss this topic.[0] It was well-attended, with
representation from almost all of the relevant projects, and it felt
like a very productive session to me. I shall do my best to summarise
whilst trying to avoid any misrepresentation ...
There was general agreement that the following actions would be
worthwhile:
- Document reference stacks describing what use cases can already be
addressed with the existing projects. (Even better if some of
these stacks have already been tested in the wild.)
- Document what integrations between the projects already exist at a
technical level. (We actually began this during the meeting, by
placing the projects into phases of a high-level flow, and then
collaboratively building a Google Drawing to show that.[1])
- Collect real-world use cases from operators, including ones which
they would like to accomplish but cannot yet.
- From the above, perform gaps analysis to help shape the future
direction of these projects, e.g. through specs targetting those
gaps.
- Perform overlap analysis to help ensure that the projects are
correctly scoped and integrate well without duplicating any
significant effort.[2]
- Set up a SIG[3] to promote further discussion across the projects
and with operators. I talked to Thierry afterwards, and
consequently this email is the first step on that path :-)
- Allocate the SIG a mailing list prefix - "[self-healing]" or
similar.
- Set up a bi-weekly IRC meeting for the SIG.
- Continue the discussion at the Sydney Forum, since it's an ideal
opportunity to get developers and operators together and decide
what the next steps should be.
- Continue the discussion at the next Ops meetup in Tokyo.
I got coerced^Wvolunteered to drive the next steps ;-) So far I
have created an etherpad proposing the Forum session[4], and added it
to the Forum wiki page[5]. I'll also add it to the SIG wiki page[6].
There were things we did not reach a concrete conclusion on:
- What should the SIG be called? We felt that "self-healing" was
pretty darn close to capturing the intent of the topic. However
as a natural pedant, I couldn't help but notice that technically
speaking, that would most undesirably exclude Watcher, because the
optimization it provides isn't *quite* "healing" - the word
"healing" implies that something is sick, and optimization can be
applied even when the cloud is perfectly healthy. Any suggestions
for a name with a marginally wider scope would be gratefully
received.
- Should the SIG be scoped to only focus on self-healing (and
self-optimization) of OpenStack infrastructure, or should it also
include self-healing of workloads? My feeling is that we should
keep it scoped to the infrastructure which falls under the
responsibility of the cloud operators; anything user-facing would
be very different from a process perspective.
- How should the SIG's governance be set up? Unfortunately it
didn't occur to me to raise this question during the discussion,
but I've since seen that the k8s SIG managed to make some
decisions in this regard[7], and stealing their idea of a PTL-type
model with a minimum of 2 chairs sounds good to me.
- Which timezone the IRC meeting should be in? As usual, there were
interested parties from all the usual continents, so no one time
would suit everyone. I guess I can just submit a review to the
irc-meetings repo and we can have a voting war in Gerrit ;-/
Another option would be to alternate timezones every week or two.
Feedback on any of this is of course most welcome! After sending
this, I'll forward it to openstack-{dev,operators} and ask for any
feedback to be submitted here.
Thanks,
Adam
[0] https://etherpad.openstack.org/p/self-healing-queens-ptg
[1] https://goo.gl/Pf2KgJ
[2] Sampath (Masakari PTL), Saad (Freezer PTL), and I had a productive
follow-up discussion on how we could aim to re-scope these two
projects to avoid unnecessary duplication of effort.
[3] https://ttx.re/introducing-sigs.html
[4] https://etherpad.openstack.org/p/self-healing-rocky-forum
[5] https://wiki.openstack.org/wiki/Forum/Sydney2017
[6] https://wiki.openstack.org/wiki/OpenStack_SIGs
[7] https://etherpad.openstack.org/p/queens-ptg-sig-k8s
More information about the openstack-sigs
mailing list