[openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

Yujun Zhang zhangyujun+zte at gmail.com
Sun Jan 15 15:53:14 UTC 2017


About fault and alarm, what I was thinking about the causal/deducing chain
in root cause analysis.

Fault state means the resource is not fully functional and it is evaluated
by related indicators. There are alarms on events like power loss or
measurands like CPU high, memory low, temperature high. There are also
alarms based on deduced state, such as "host fault", "instance fault".

So an example chain would be

   - "FAULT: power line cut off" =(monitor)=> "ALARM: host power loss"
   =(inspect)=> "FAULT: host is unavailable" =(action)=> "ALARM: host fault"
   - "FAULT: power line cut off" =(monitor)=> "ALARM: host power loss"
   =(inspect)=> "FAULT: host is unavailable" =(inspect)=> "FAULT: instance is
   unavailable" =(action)=> "ALARM: instance fault"

If we omit the resource, then we get the causal chain as it is in Vitrage

   - "ALARM: host power loss" =(causes)=> "ALARM: host fault"
   - "ALARM: host power loss" =(causes)=> "ALARM: instance fault"

But what the user care about might be there "FAULT: power line cut off"
causes all these alarms. What I haven't made clear yet is the equivalence
between fault and alarm.

I may have made it more complex with my *immature* thoughts. It could be
even more complex if we consider multiple upstream causes and downstream
outcome. It may be an interesting topic to be discussed in design session.

On Sun, Jan 15, 2017 at 9:21 PM Afek, Ifat (Nokia - IL) <ifat.afek at nokia.com>
wrote:

> Hi Yinliyin,
>
>
>
> There are two use cases:
>
> One is yours, where you have a single monitor that generates “real”
> alarms, and Vitrage that generates deduced alarms.
>
> Another is where someone has a few monitors, and there might be a
> collision/equivalence between their alarms.
>
>
>
> The solution that you suggested might solve the first use case, but I
> wouldn’t want to ignore the second one, which is also valid.
>
>
>
> Regarding some of your specific suggestions:
>
> 1.       In templates, we only define the alarm entity for the datasource
> that the alarm is reported by, such as Nagios.
>
> [Ifat] This will only work for a single monitor.
>
>        2.  When evaluator deduce an alarm, it would raise the alarm with
> the type set to be the datasource that would report the alarm, not be
> vitrage.
>
> [Ifat] I don’t think this is right. In Vitrage Alarm view in the UI,
> displaying the deduced alarm as “Nagios” is misleading, since Nagios did
> not report this alarm.
>
>
>
> I can think of a solution that is specific to the deduced alarms case,
> where we will replace a Vitrage alarm with a “real” alarm whenever there is
> a collision. This solution is easier, but we should carefully examine all
> use cases to make sure there is no ambiguity. However, for the more general
> use case I would prefer the option that we discussed in a previous mail, of
> having two (or more) alarms connected with a ‘equivalent’ relationship.
>
>
>
> What do you think?
>
> Ifat.
>
>
>
>
>
> *From: *"yinliyin at zte.com.cn" <yinliyin at zte.com.cn>
> *Date: *Saturday, 14 January 2017 at 09:57
>
> ·         It won’t solve the general problem of two different monitors
> that raise the same alarm
>
> ·           [yinliyin] Generally, we would only deploy one monitor for a
> same alarm.
>
> ·         It won’t solve possible conflicts of timestamp and severity
> between different monitors
>
> ·          [yinliyin] Please see the following contents.
>
> ·         It will make the decision of when to delete the alarm more
> complex (delete it when the deduced alarm is deleted? When Nagios alarm is
> deleted? both? And how to change the timestamp and severity in these cases?)
>
> ·          [yinliyin] Please see the following contents.
>
>    The following is the basic idea of solving the problem in this
> situation:
>
>        1.  In templates, we only define the alarm entity for the
> datasource that the alarm is reported by, such as Nagios.
>
>        2.  When evaluator deduce an alarm, it would raise the alarm with
> the type set to be the datasource that would report the alarm, not be
> vitrage.
>
>        3.  When entity_graph get the events from the "evaluator_queue"(all
> the alarms in the "evaluator_queue" are deduced alarms), it queries the
> graph to find out whether there was a same alarm reported  by datasource.
> If  it was true,  it would discard the alarm.
>
>       4.  When entity_graph get the events from "queue",  it queries the
> graph to find out whether there was a same alarm deduced by evaluator. If
> it was true, it would replace the alarm in the graph with the newly arrived
> alarm reported by the datasource.
>
>      5.  When the evaluator deduced that an alarm would be deleted, it
> deletes the alarm whatever the generation type of the alarm be(Generated by
> datasource or deduced by evaluator).
>
>      6. When datasource reports recover event of an alarm, entity_graph
> would query graph to find out whether the alarm was exist. If the alarm was
> not exist, entity_graph would discard the event.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170115/f4fab2ec/attachment.html>


More information about the OpenStack-dev mailing list