[openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

Yujun Zhang zhangyujun+zte at gmail.com
Wed Jan 11 10:12:41 UTC 2017


I have just realized abstract alarm is not a good term. What I was talking
about is *fault* and *alarm*.

Fault is what actually happens, and alarm is how it is detected (or
deduced).

On Wed, Jan 11, 2017 at 5:13 PM Yujun Zhang <zhangyujun+zte at gmail.com>
wrote:

> Yes, if we consider the Vitrage scenario evaluator as a pseudo monitor.
>
> I think YinLiYin's idea is a reasonable requirement from end user. They
> care more about the *real faults* in the system, not how they are
> detected. Though it will bring much challenge to design and engineering, it
> creates value for customers. I'm quite positive on this evolution.
>
> One possible solution would be introducing a high level (abstract)
> template from users view. Then convert it to Vitrage scenario templates (or
> directly to graph). The *more sources* (nagios, vitrage deduction) for an
> abstract alarm we get from the system, the *more confidence* we get for a
> real fault. And the confidence of an alarm could be included in the
> scenario condition.
>
> On Wed, Jan 11, 2017 at 4:08 PM Afek, Ifat (Nokia - IL) <
> ifat.afek at nokia.com> wrote:
>
> You are right. But as I see it, the case of Vitrage suspect vs. the real
> Nagios alarm is just one example of the more general case of two monitors
> reporting the same alarm.
>
> Don’t you think so?
>
>
>
> *From: *Yujun Zhang <zhangyujun+zte at gmail.com>
>
>
> *Reply-To: *"OpenStack Development Mailing List (not for usage
> questions)" <openstack-dev at lists.openstack.org>
>
> *Date: *Wednesday, 11 January 2017 at 09:46
> *To: *"OpenStack Development Mailing List (not for usage questions)" <
> openstack-dev at lists.openstack.org>, "yinliyin at zte.com.cn" <
> yinliyin at zte.com.cn>
> *Cc: *"han.jing28 at zte.com.cn" <han.jing28 at zte.com.cn>, "
> wang.weiya at zte.com.cn" <wang.weiya at zte.com.cn>, "zhang.yujunz at zte.com.cn"
> <zhang.yujunz at zte.com.cn>, "jia.peiyuan at zte.com.cn" <
> jia.peiyuan at zte.com.cn>, "gong.yahui5 at zte.com.cn" <gong.yahui5 at zte.com.cn>
>
>
> *Subject: *Re: [openstack-dev] [Vitrage] About alarms reported by
> datasource and the alarms generated by vitrage evaluator
>
>
>
> Hi, Ifat
>
>
>
> If I understand it correctly, your concerns are mainly on same alarm from
> different monitor, but not "suspect" status as discussed in another thread.
>
>
>
> On Tue, Jan 10, 2017 at 10:21 PM Afek, Ifat (Nokia - IL) <
> ifat.afek at nokia.com> wrote:
>
> Hi Yinliyin,
>
>
>
> At first I thought that changing the deduced to be a property on the alarm
> might help in solving your use case. But now I think most of the problems
> will remain the same:
>
>
>
> ·  It won’t solve the general problem of two different monitors that
> raise the same alarm
>
> ·  It won’t solve possible conflicts of timestamp and severity between
> different monitors
>
> ·  It will make the decision of when to delete the alarm more complex
> (delete it when the deduced alarm is deleted? When Nagios alarm is deleted?
> both? And how to change the timestamp and severity in these cases?)
>
>
>
> So I don’t think that making this change is beneficial.
>
> What do you think?
>
>
>
> Best Regards,
>
> Ifat.
>
>
>
>
>
> *From: *"yinliyin at zte.com.cn" <yinliyin at zte.com.cn>
> *Date: *Monday, 9 January 2017 at 05:29
> *To: *"Afek, Ifat (Nokia - IL)" <ifat.afek at nokia.com>
> *Cc: *"openstack-dev at lists.openstack.org" <
> openstack-dev at lists.openstack.org>, "han.jing28 at zte.com.cn" <
> han.jing28 at zte.com.cn>, "wang.weiya at zte.com.cn" <wang.weiya at zte.com.cn>, "
> zhang.yujunz at zte.com.cn" <zhang.yujunz at zte.com.cn>, "
> jia.peiyuan at zte.com.cn" <jia.peiyuan at zte.com.cn>, "gong.yahui5 at zte.com.cn"
> <gong.yahui5 at zte.com.cn>
> *Subject: *Re: [openstack-dev] [Vitrage] About alarms reported by
> datasource and the alarms generated by vitrage evaluator
>
>
>
> Hi Ifat,
>
>          I think there is a situation that all the alarms are reported by
> the monitored system. We use vitrage to:
>
>             1.  Found the relationships of the alarms, and find the root
> cause.
>
>             2.  Deduce the alarm before it really occured. This comprise
> two aspects:
>
>                  1) A cause B:  When A occured,  we deduce that B would
> occur
>
>                  2) B is caused by A:  When B occured, we deduce that A
> must occured
>
>             In "2",   we do expect vitrage to raise the alarm before the
> alarm is reported because the alarm would be lost or be delayed for some
> reason.  So we would write "raise alarm" actions in the scenarios of the
> template.  I think that the alarm is reported or is deduced should be a
> state property of the alarm. The vertex reported and the vertex deduced of
> the same alarm should be merged to one vertex.
>
>
>
>      Best Regards,
>
>      Yinliyin.
>
> 原始邮件
>
> *发件人:* <ifat.afek at nokia.com>;
>
> *收件人:* <openstack-dev at lists.openstack.org>;
>
> *抄送人:*韩静00006838;王维雅00042110;章宇军10200531;贾培源10101785;龚亚辉6092001895
> <(609)%20200-1895>;
>
> *日* *期* *:*2017年01月07日 02:18
>
> *主* *题* *:**Re: [openstack-dev] [Vitrage] About alarms reported by
> datasource and the alarms generated by vitrage evaluator*
>
>
>
> Hi YinLiYin,
>
>
>
> This is an interesting question. Let me divide my answer to two parts.
>
>
>
> First, the case that you described with Nagios and Vitrage. This problem
> depends on the specific Nagios tests that you configure in your system, as
> well as on the Vitrage templates that  you use. For example, you can use
> Nagios/Zabbix to monitor the physical layer, and Vitrage to raise deduced
> alarms on the virtual and application layers. This way you will never have
> duplicated alarms. If you want to use Nagios to monitor the other layers
>  as well, you can simply modify Vitrage templates so they don’t raise the
> deduced alarms that Nagios may generate, and use the templates to show RCA
> between different Nagios alarms.
>
>
>
> Now let’s talk about the more general case. Vitrage can receive alarms
> from different monitors, including Nagios, Zabbix, collectd and Aodh. If
> you are using more than one monitor, it is  possible that the same alarm
> (maybe with a different name) will be raised twice. We need to create a
> mechanism to identify such cases and create a single alarm with the
> properties of both monitors. This has not been designed in details yet, so
> if you have  any suggestion we will be happy to hear them.
>
>
>
> Best Regards,
>
> Ifat.
>
>
>
>
>
> *From: *"yinliyin at zte.com.cn" <yinliyin at zte.com.cn> *Reply-To: *"OpenStack Development Mailing List (not for usage
> questions)" <openstack-dev at lists.openstack.org> *Date: *Friday, 6 January 2017 at 03:27
> *To: *"openstack-dev at lists.openstack.org" <
> openstack-dev at lists.openstack.org> *Cc: *"gong.yahui5 at zte.com.cn" <gong.yahui5 at zte.com.cn>, "
> han.jing28 at zte.com.cn" <han.jing28 at zte.com.cn>, "wang.weiya at zte.com.cn" <
> wang.weiya at zte.com.cn>, "jia.peiyuan at zte.com.cn" <jia.peiyuan at zte.com.cn>,
> "zhang.yujunz at zte.com.cn" <zhang.yujunz at zte.com.cn> *Subject: *[openstack-dev] [Vitrage] About alarms reported by datasource
> and the alarms generated by vitrage evaluator
>
>
>
> Hi all,
>
>    Vitrage generate alarms acording to the templates. All the alarms
> raised by vitrage has the type "vitrage". Suppose Nagios has an alarm A.
> Alarm A is raised by vitrage evaluator according to the action part of a
> scenario, type  of alarm A is "vitrage". If Nagios reported alarm A latter,
> a new alarm A with type "Nagios" would be generator in the entity graph.
>   There would be two vertices for the same alarm in the graph. And we have
> to define two alarm entities, two relationships,  two scenarios in the
> template file to make the alarm propagation procedure work.
>
>    It is inconvenient to describe fault model of system with lot of
> alarms. How to solve this problem?
>
>
>
> 殷力殷 YinLiYin
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170111/e50af64a/attachment.html>


More information about the OpenStack-dev mailing list