[openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

Yujun Zhang zhangyujun+zte at gmail.com
Thu Jan 12 15:37:37 UTC 2017


Hi, Ifat

You comments is quite right. See my additional explanation inline.

On Thu, Jan 12, 2017 at 5:12 PM Afek, Ifat (Nokia - IL) <ifat.afek at nokia.com>
wrote:

>
>
> One possible solution would be introducing a high level (abstract)
> template from users view. Then convert it to Vitrage scenario templates (or
> directly to graph). The *more sources* (nagios, vitrage deduction) for an
> abstract alarm we get from the system, the *more confidence* we get for a
> real fault. And the confidence of an alarm could be included in the
> scenario condition.
>
>
>
> [Ifat] I understand your idea, not sure yet if it helps with the use case.
>
> How would you imagine the ‘confidence’ property? As Boolean or a counter?
> One option is ‘deduced’ vs. ‘monitored’.
>
Another option is to count the number of monitors that reported it.
>

'deduced' vs 'monitored' would be good enough for most cases. Unless we
have identify some real use case, I also think there is no need for bring
in quantitative indicator like counter or probability.


> Personally, I don’t think this is needed. I think that if Nagios reports
> an error, then it is confident enough without getting it from another
> monitor.
>

You are right. We would consider a reported alarm as a reliable indicator
of fault. What I was thinking about is: when we the alarm is not seen, can
we be sure there is no fault?

Another situation is slow upstream alarm with fast downstream alarm. I
don't have an actual example for the moment, so please allow me to imagine
an extreme condition.

Suppose host fault will cause instance fault. But due to some restriction,
the host fault is scanned every 1 hour, but instance fault can be scanned
every 1 second. Now, we get alarms from 10 instance in the same host. Can
we deduce that the host is likely in fault status? And we may raise a
"deduced" alarm on the host and trigger an immediate scan which may result
in a "monitored" alarm. In this way, we reduce the time of detecting the
root cause, i.e host fault.

An alternative solution is to distinguish fault from alarm. Alarm is
actually a reflection of fault status.  Beside the directly linked alarm,
fault status can also be deduced from downstream alarms. I haven't think
over this model yet, it just flashed over my mind. Any comments are welcome.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170112/7212ba26/attachment.html>


More information about the OpenStack-dev mailing list