<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Sun, Sep 17, 2017 at 11:34 PM Adam Spiers <<a href="mailto:aspiers@suse.com">aspiers@suse.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>

<br>

[TL;DR: we want to set up a "self-healing infrastructure" SIG.]<br></blockquote><div>Nice!</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

One of the biggest promises of the cloud vision was the idea that all<br>

the infrastructure could be managed in a policy-driven fashion,<br>

reacting to failures and other events by automatically healing and<br>

optimising services.  Most of the components required to implement<br>

such an architecture already exist, e.g.<br>

<br>

  - Monasca: Monitoring<br>

  - Aodh: Alarming<br>

  - Congress: Policy-based governance<br>

  - Mistral: Workflow<br>

  - Senlin: Clustering<br>

  - Vitrage: Root Cause Analysis<br>

  - Watcher: Optimization<br>

  - Masakari: Compute plane HA<br>

  - Freezer-dr: DR and compute plane HA<br>

<br>

However, there is not yet a clear strategy within the community for<br>

how these should all tie together.<br>

<br>

So at the PTG last week in Denver, we held an initial cross-project<br>

meeting to discuss this topic.[0]  It was well-attended, with<br>

representation from almost all of the relevant projects, and it felt<br>

like a very productive session to me.  I shall do my best to summarise<br>

whilst trying to avoid any misrepresentation ...<br></blockquote><div><br></div><div>I'm sorry that I missed the session at the PTG :)</div><div><br></div><div>Do you have any plan / idea yet about how verification might look like for the</div><div>integration between all the projects in your list and for self-healing specific </div><div>scenarios?</div><div><br></div><div>During the QA sessions at the PTG we discussed about HA / fault tolerance</div><div>testing. There is a proposal for a community framework for that, however</div><div>we have no plan yet about where to run / how to maintain such tests for</div><div>OpenStack. It might be a fitting use case for this rising SIG.</div><div><br></div><div>Andrea Frittoli (andreaf)</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

There was general agreement that the following actions would be<br>

worthwhile:<br>

<br>

  - Document reference stacks describing what use cases can already be<br>

    addressed with the existing projects.  (Even better if some of<br>

    these stacks have already been tested in the wild.)<br>

<br>

  - Document what integrations between the projects already exist at a<br>

    technical level.  (We actually began this during the meeting, by<br>

    placing the projects into phases of a high-level flow, and then<br>

    collaboratively building a Google Drawing to show that.[1])<br>

<br>

  - Collect real-world use cases from operators, including ones which<br>

    they would like to accomplish but cannot yet.<br>

<br>

  - From the above, perform gaps analysis to help shape the future<br>

    direction of these projects, e.g. through specs targetting those<br>

    gaps.<br>

<br>

  - Perform overlap analysis to help ensure that the projects are<br>

    correctly scoped and integrate well without duplicating any<br>

    significant effort.[2]<br>

<br>

  - Set up a SIG[3] to promote further discussion across the projects<br>

    and with operators.  I talked to Thierry afterwards, and<br>

    consequently this email is the first step on that path :-)<br>

<br>

  - Allocate the SIG a mailing list prefix - "[self-healing]" or<br>

    similar.<br>

<br>

  - Set up a bi-weekly IRC meeting for the SIG.<br>

<br>

  - Continue the discussion at the Sydney Forum, since it's an ideal<br>

    opportunity to get developers and operators together and decide<br>

    what the next steps should be.<br>

<br>

  - Continue the discussion at the next Ops meetup in Tokyo.<br>

<br>

I got coerced^Wvolunteered to drive the next steps ;-)  So far I<br>

have created an etherpad proposing the Forum session[4], and added it<br>

to the Forum wiki page[5].  I'll also add it to the SIG wiki page[6].<br>

<br>

There were things we did not reach a concrete conclusion on:<br>

<br>

  - What should the SIG be called?  We felt that "self-healing" was<br>

    pretty darn close to capturing the intent of the topic.  However<br>

    as a natural pedant, I couldn't help but notice that technically<br>

    speaking, that would most undesirably exclude Watcher, because the<br>

    optimization it provides isn't *quite* "healing" - the word<br>

    "healing" implies that something is sick, and optimization can be<br>

    applied even when the cloud is perfectly healthy.  Any suggestions<br>

    for a name with a marginally wider scope would be gratefully<br>

    received.<br>

<br>

  - Should the SIG be scoped to only focus on self-healing (and<br>

    self-optimization) of OpenStack infrastructure, or should it also<br>

    include self-healing of workloads?  My feeling is that we should<br>

    keep it scoped to the infrastructure which falls under the<br>

    responsibility of the cloud operators; anything user-facing would<br>

    be very different from a process perspective.<br>

<br>

  - How should the SIG's governance be set up?  Unfortunately it<br>

    didn't occur to me to raise this question during the discussion,<br>

    but I've since seen that the k8s SIG managed to make some<br>

    decisions in this regard[7], and stealing their idea of a PTL-type<br>

    model with a minimum of 2 chairs sounds good to me.<br>

<br>

  - Which timezone the IRC meeting should be in?  As usual, there were<br>

    interested parties from all the usual continents, so no one time<br>

    would suit everyone.  I guess I can just submit a review to the<br>

    irc-meetings repo and we can have a voting war in Gerrit ;-/<br>

    Another option would be to alternate timezones every week or two.<br>

<br>

Feedback on any of this is of course most welcome!  After sending<br>

this, I'll forward it to openstack-{dev,operators} and ask for any<br>

feedback to be submitted here.<br>

<br>

Thanks,<br>

Adam<br>

<br>

<br>

  [0] <a href="https://etherpad.openstack.org/p/self-healing-queens-ptg" rel="noreferrer" target="_blank">https://etherpad.openstack.org/p/self-healing-queens-ptg</a><br>

<br>

  [1] <a href="https://goo.gl/Pf2KgJ" rel="noreferrer" target="_blank">https://goo.gl/Pf2KgJ</a><br>

<br>

  [2] Sampath (Masakari PTL), Saad (Freezer PTL), and I had a productive<br>

      follow-up discussion on how we could aim to re-scope these two<br>

      projects to avoid unnecessary duplication of effort.<br>

<br>

  [3] <a href="https://ttx.re/introducing-sigs.html" rel="noreferrer" target="_blank">https://ttx.re/introducing-sigs.html</a><br>

<br>

  [4] <a href="https://etherpad.openstack.org/p/self-healing-rocky-forum" rel="noreferrer" target="_blank">https://etherpad.openstack.org/p/self-healing-rocky-forum</a><br>

<br>

  [5] <a href="https://wiki.openstack.org/wiki/Forum/Sydney2017" rel="noreferrer" target="_blank">https://wiki.openstack.org/wiki/Forum/Sydney2017</a><br>

<br>

  [6] <a href="https://wiki.openstack.org/wiki/OpenStack_SIGs" rel="noreferrer" target="_blank">https://wiki.openstack.org/wiki/OpenStack_SIGs</a><br>

<br>

  [7] <a href="https://etherpad.openstack.org/p/queens-ptg-sig-k8s" rel="noreferrer" target="_blank">https://etherpad.openstack.org/p/queens-ptg-sig-k8s</a><br>

<br>

_______________________________________________<br>

Openstack-sigs mailing list<br>

<a href="mailto:Openstack-sigs@lists.openstack.org" target="_blank">Openstack-sigs@lists.openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-sigs" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-sigs</a><br>

</blockquote></div></div>