[Openstack-sigs] [self-healing][PTG][congress][monasca] etherpad for PTG session on self-healing

Adam Spiers aspiers at suse.com
Tue Mar 13 17:51:35 UTC 2018


Hi Nemat,

Sorry for the slow reply, it's been busy with the PTG and the
resulting backlog :-)

Nematollah Bidokhti <Nematollah.Bidokhti at huawei.com> wrote:
>Hi Adam,
>
>I divide the self-healing issues into two categories (maybe there are more):
>
>1- Known issues (these are the error codes that have been identified by individual projects)
>2- Unknown issues (things that we find out based on real-time monitoring and anomaly detection)

Yes, that's one reasonable way to divide them.

>What will be the initial focus of self-healing Sig?

The first, since the "self-" part of self-healing requires that we can
automate action to address issues which need fixing, and that in turn
requires a strong understanding of the issues.

>The 2nd item is a lot more complicated and will take time to define
>and implement.

Absolutely.  I see that as a much longer-term goal.

> The 1st one is feasible and can be achieved in reasonable time. This
> also depends on the type of issues.

Exactly, we have to walk before we can run :-)

>For example, networking issues could be difficult to monitor and recover.

That's interesting that you picked networking issues as an example,
since NIC failure is one of the first use cases which has already been
tackled:

    https://www.openstack.org/videos/sydney-2017/advanced-fault-management-with-vitrage-and-mistral

and I also mentioned this last night in a presentation to the London
OpenStack Meetup:

    https://aspiers.github.io/openstack-meetup-london-march-2018-self-healing/#/use-case-1

But of course there are many different types of networking issues, and
tackling them as a whole is much harder than tackling an individual
failure case :-)

Cheers,
Adam



More information about the openstack-sigs mailing list