<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Sun, Sep 17, 2017 at 11:34 PM Adam Spiers <<a href="mailto:aspiers@suse.com">aspiers@suse.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>
<br>
[TL;DR: we want to set up a "self-healing infrastructure" SIG.]<br></blockquote><div>Nice!</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
One of the biggest promises of the cloud vision was the idea that all<br>
the infrastructure could be managed in a policy-driven fashion,<br>
reacting to failures and other events by automatically healing and<br>
optimising services. Most of the components required to implement<br>
such an architecture already exist, e.g.<br>
<br>
- Monasca: Monitoring<br>
- Aodh: Alarming<br>
- Congress: Policy-based governance<br>
- Mistral: Workflow<br>
- Senlin: Clustering<br>
- Vitrage: Root Cause Analysis<br>
- Watcher: Optimization<br>
- Masakari: Compute plane HA<br>
- Freezer-dr: DR and compute plane HA<br>
<br>
However, there is not yet a clear strategy within the community for<br>
how these should all tie together.<br>
<br>
So at the PTG last week in Denver, we held an initial cross-project<br>
meeting to discuss this topic.[0] It was well-attended, with<br>
representation from almost all of the relevant projects, and it felt<br>
like a very productive session to me. I shall do my best to summarise<br>
whilst trying to avoid any misrepresentation ...<br></blockquote><div><br></div><div>I'm sorry that I missed the session at the PTG :)</div><div><br></div><div>Do you have any plan / idea yet about how verification might look like for the</div><div>integration between all the projects in your list and for self-healing specific </div><div>scenarios?</div><div><br></div><div>During the QA sessions at the PTG we discussed about HA / fault tolerance</div><div>testing. There is a proposal for a community framework for that, however</div><div>we have no plan yet about where to run / how to maintain such tests for</div><div>OpenStack. It might be a fitting use case for this rising SIG.</div><div><br></div><div>Andrea Frittoli (andreaf)</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
There was general agreement that the following actions would be<br>
worthwhile:<br>
<br>
- Document reference stacks describing what use cases can already be<br>
addressed with the existing projects. (Even better if some of<br>
these stacks have already been tested in the wild.)<br>
<br>
- Document what integrations between the projects already exist at a<br>
technical level. (We actually began this during the meeting, by<br>
placing the projects into phases of a high-level flow, and then<br>
collaboratively building a Google Drawing to show that.[1])<br>
<br>
- Collect real-world use cases from operators, including ones which<br>
they would like to accomplish but cannot yet.<br>
<br>
- From the above, perform gaps analysis to help shape the future<br>
direction of these projects, e.g. through specs targetting those<br>
gaps.<br>
<br>
- Perform overlap analysis to help ensure that the projects are<br>
correctly scoped and integrate well without duplicating any<br>
significant effort.[2]<br>
<br>
- Set up a SIG[3] to promote further discussion across the projects<br>
and with operators. I talked to Thierry afterwards, and<br>
consequently this email is the first step on that path :-)<br>
<br>
- Allocate the SIG a mailing list prefix - "[self-healing]" or<br>
similar.<br>
<br>
- Set up a bi-weekly IRC meeting for the SIG.<br>
<br>
- Continue the discussion at the Sydney Forum, since it's an ideal<br>
opportunity to get developers and operators together and decide<br>
what the next steps should be.<br>
<br>
- Continue the discussion at the next Ops meetup in Tokyo.<br>
<br>
I got coerced^Wvolunteered to drive the next steps ;-) So far I<br>
have created an etherpad proposing the Forum session[4], and added it<br>
to the Forum wiki page[5]. I'll also add it to the SIG wiki page[6].<br>
<br>
There were things we did not reach a concrete conclusion on:<br>
<br>
- What should the SIG be called? We felt that "self-healing" was<br>
pretty darn close to capturing the intent of the topic. However<br>
as a natural pedant, I couldn't help but notice that technically<br>
speaking, that would most undesirably exclude Watcher, because the<br>
optimization it provides isn't *quite* "healing" - the word<br>
"healing" implies that something is sick, and optimization can be<br>
applied even when the cloud is perfectly healthy. Any suggestions<br>
for a name with a marginally wider scope would be gratefully<br>
received.<br>
<br>
- Should the SIG be scoped to only focus on self-healing (and<br>
self-optimization) of OpenStack infrastructure, or should it also<br>
include self-healing of workloads? My feeling is that we should<br>
keep it scoped to the infrastructure which falls under the<br>
responsibility of the cloud operators; anything user-facing would<br>
be very different from a process perspective.<br>
<br>
- How should the SIG's governance be set up? Unfortunately it<br>
didn't occur to me to raise this question during the discussion,<br>
but I've since seen that the k8s SIG managed to make some<br>
decisions in this regard[7], and stealing their idea of a PTL-type<br>
model with a minimum of 2 chairs sounds good to me.<br>
<br>
- Which timezone the IRC meeting should be in? As usual, there were<br>
interested parties from all the usual continents, so no one time<br>
would suit everyone. I guess I can just submit a review to the<br>
irc-meetings repo and we can have a voting war in Gerrit ;-/<br>
Another option would be to alternate timezones every week or two.<br>
<br>
Feedback on any of this is of course most welcome! After sending<br>
this, I'll forward it to openstack-{dev,operators} and ask for any<br>
feedback to be submitted here.<br>
<br>
Thanks,<br>
Adam<br>
<br>
<br>
[0] <a href="https://etherpad.openstack.org/p/self-healing-queens-ptg" rel="noreferrer" target="_blank">https://etherpad.openstack.org/p/self-healing-queens-ptg</a><br>
<br>
[1] <a href="https://goo.gl/Pf2KgJ" rel="noreferrer" target="_blank">https://goo.gl/Pf2KgJ</a><br>
<br>
[2] Sampath (Masakari PTL), Saad (Freezer PTL), and I had a productive<br>
follow-up discussion on how we could aim to re-scope these two<br>
projects to avoid unnecessary duplication of effort.<br>
<br>
[3] <a href="https://ttx.re/introducing-sigs.html" rel="noreferrer" target="_blank">https://ttx.re/introducing-sigs.html</a><br>
<br>
[4] <a href="https://etherpad.openstack.org/p/self-healing-rocky-forum" rel="noreferrer" target="_blank">https://etherpad.openstack.org/p/self-healing-rocky-forum</a><br>
<br>
[5] <a href="https://wiki.openstack.org/wiki/Forum/Sydney2017" rel="noreferrer" target="_blank">https://wiki.openstack.org/wiki/Forum/Sydney2017</a><br>
<br>
[6] <a href="https://wiki.openstack.org/wiki/OpenStack_SIGs" rel="noreferrer" target="_blank">https://wiki.openstack.org/wiki/OpenStack_SIGs</a><br>
<br>
[7] <a href="https://etherpad.openstack.org/p/queens-ptg-sig-k8s" rel="noreferrer" target="_blank">https://etherpad.openstack.org/p/queens-ptg-sig-k8s</a><br>
<br>
_______________________________________________<br>
Openstack-sigs mailing list<br>
<a href="mailto:Openstack-sigs@lists.openstack.org" target="_blank">Openstack-sigs@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-sigs" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-sigs</a><br>
</blockquote></div></div>