[openstack-dev] [Openstack-sigs] [meta] Proposal for self-healing SIG (fwd)

Adam Spiers aspiers at suse.com
Sun Sep 17 22:58:06 UTC 2017

Hi everyone,

As per below, I've just proposed the creation of a new SIG.  Feedback
is very welcome - ideally it would all be collected in the same thread
I started over on the openstack-sigs list, but feedback in two places
is more useful than nowhere, so I'll keep an eye out here too ;-)


----- Forwarded message from Adam Spiers <aspiers at suse.com> -----

Date: Sun, 17 Sep 2017 23:35:02 +0100
From: Adam Spiers <aspiers at suse.com>
To: OpenStack SIGs list <openstack-sigs at lists.openstack.org>
Subject: [Openstack-sigs] [meta] Proposal for self-healing SIG

Hi all, 

[TL;DR: we want to set up a "self-healing infrastructure" SIG.] 

One of the biggest promises of the cloud vision was the idea that all 
the infrastructure could be managed in a policy-driven fashion, 
reacting to failures and other events by automatically healing and 
optimising services.  Most of the components required to implement 
such an architecture already exist, e.g. 

  - Monasca: Monitoring
  - Aodh: Alarming
  - Congress: Policy-based governance
  - Mistral: Workflow
  - Senlin: Clustering
  - Vitrage: Root Cause Analysis
  - Watcher: Optimization
  - Masakari: Compute plane HA
  - Freezer-dr: DR and compute plane HA

However, there is not yet a clear strategy within the community for 
how these should all tie together. 

So at the PTG last week in Denver, we held an initial cross-project 
meeting to discuss this topic.[0]  It was well-attended, with 
representation from almost all of the relevant projects, and it felt 
like a very productive session to me.  I shall do my best to summarise 
whilst trying to avoid any misrepresentation ...

There was general agreement that the following actions would be 

  - Document reference stacks describing what use cases can already be
    addressed with the existing projects.  (Even better if some of
    these stacks have already been tested in the wild.)

  - Document what integrations between the projects already exist at a
    technical level.  (We actually began this during the meeting, by
    placing the projects into phases of a high-level flow, and then
    collaboratively building a Google Drawing to show that.[1])

  - Collect real-world use cases from operators, including ones which
    they would like to accomplish but cannot yet.

  - From the above, perform gaps analysis to help shape the future
    direction of these projects, e.g. through specs targetting those

  - Perform overlap analysis to help ensure that the projects are
    correctly scoped and integrate well without duplicating any
    significant effort.[2]

  - Set up a SIG[3] to promote further discussion across the projects
    and with operators.  I talked to Thierry afterwards, and
    consequently this email is the first step on that path :-)

  - Allocate the SIG a mailing list prefix - "[self-healing]" or

  - Set up a bi-weekly IRC meeting for the SIG.

  - Continue the discussion at the Sydney Forum, since it's an ideal
    opportunity to get developers and operators together and decide
    what the next steps should be.

  - Continue the discussion at the next Ops meetup in Tokyo.

I got coerced^Wvolunteered to drive the next steps ;-)  So far I 
have created an etherpad proposing the Forum session[4], and added it 
to the Forum wiki page[5].  I'll also add it to the SIG wiki page[6]. 

There were things we did not reach a concrete conclusion on: 

  - What should the SIG be called?  We felt that "self-healing" was
    pretty darn close to capturing the intent of the topic.  However
    as a natural pedant, I couldn't help but notice that technically
    speaking, that would most undesirably exclude Watcher, because the
    optimization it provides isn't *quite* "healing" - the word
    "healing" implies that something is sick, and optimization can be
    applied even when the cloud is perfectly healthy.  Any suggestions
    for a name with a marginally wider scope would be gratefully

  - Should the SIG be scoped to only focus on self-healing (and
    self-optimization) of OpenStack infrastructure, or should it also
    include self-healing of workloads?  My feeling is that we should
    keep it scoped to the infrastructure which falls under the
    responsibility of the cloud operators; anything user-facing would
    be very different from a process perspective.

  - How should the SIG's governance be set up?  Unfortunately it
    didn't occur to me to raise this question during the discussion,
    but I've since seen that the k8s SIG managed to make some
    decisions in this regard[7], and stealing their idea of a PTL-type
    model with a minimum of 2 chairs sounds good to me.

  - Which timezone the IRC meeting should be in?  As usual, there were
    interested parties from all the usual continents, so no one time
    would suit everyone.  I guess I can just submit a review to the
    irc-meetings repo and we can have a voting war in Gerrit ;-/
    Another option would be to alternate timezones every week or two.

Feedback on any of this is of course most welcome!  After sending
this, I'll forward it to openstack-{dev,operators} and ask for any
feedback to be submitted here.


  [0] https://etherpad.openstack.org/p/self-healing-queens-ptg

  [1] https://goo.gl/Pf2KgJ

  [2] Sampath (Masakari PTL), Saad (Freezer PTL), and I had a productive
      follow-up discussion on how we could aim to re-scope these two
      projects to avoid unnecessary duplication of effort.

  [3] https://ttx.re/introducing-sigs.html

  [4] https://etherpad.openstack.org/p/self-healing-rocky-forum

  [5] https://wiki.openstack.org/wiki/Forum/Sydney2017

  [6] https://wiki.openstack.org/wiki/OpenStack_SIGs

  [7] https://etherpad.openstack.org/p/queens-ptg-sig-k8s

Openstack-sigs mailing list
Openstack-sigs at lists.openstack.org

----- End forwarded message -----

More information about the OpenStack-dev mailing list