Open Stack

Mon Jan 16 14:39:30 UTC 2017

From: Yujun Zhang <zhangyujun+zte at gmail.com>
Date: Sunday, 15 January 2017 at 17:53

About fault and alarm, what I was thinking about the causal/deducing chain in root cause analysis.

Fault state means the resource is not fully functional and it is evaluated by related indicators. There are alarms on events like power loss or measurands like CPU high, memory low, temperature high. There are also alarms based on deduced state, such as "host fault", "instance fault".

So an example chain would be
·         "FAULT: power line cut off" =(monitor)=> "ALARM: host power loss" =(inspect)=> "FAULT: host is unavailable" =(action)=> "ALARM: host fault"
·         "FAULT: power line cut off" =(monitor)=> "ALARM: host power loss" =(inspect)=> "FAULT: host is unavailable" =(inspect)=> "FAULT: instance is unavailable" =(action)=> "ALARM: instance fault"
If we omit the resource, then we get the causal chain as it is in Vitrage
·         "ALARM: host power loss" =(causes)=> "ALARM: host fault"
·         "ALARM: host power loss" =(causes)=> "ALARM: instance fault"
But what the user care about might be there "FAULT: power line cut off" causes all these alarms. What I haven't made clear yet is the equivalence between fault and alarm.

I may have made it more complex with my immature thoughts. It could be even more complex if we consider multiple upstream causes and downstream outcome. It may be an interesting topic to be discussed in design session.

[Ifat] I agree. Let’s discuss this in the next design session we’ll have

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170116/bea8fe90/attachment.html>

Open Stack

[openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

OpenStack

Community

Documentation

Branding & Legal