[openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator
Yujun Zhang
zhangyujun+zte at gmail.com
Wed Jan 11 09:13:04 UTC 2017
Yes, if we consider the Vitrage scenario evaluator as a pseudo monitor.
I think YinLiYin's idea is a reasonable requirement from end user. They
care more about the *real faults* in the system, not how they are detected.
Though it will bring much challenge to design and engineering, it creates
value for customers. I'm quite positive on this evolution.
One possible solution would be introducing a high level (abstract) template
from users view. Then convert it to Vitrage scenario templates (or directly
to graph). The *more sources* (nagios, vitrage deduction) for an abstract
alarm we get from the system, the *more confidence* we get for a real
fault. And the confidence of an alarm could be included in the scenario
condition.
On Wed, Jan 11, 2017 at 4:08 PM Afek, Ifat (Nokia - IL) <ifat.afek at nokia.com>
wrote:
> You are right. But as I see it, the case of Vitrage suspect vs. the real
> Nagios alarm is just one example of the more general case of two monitors
> reporting the same alarm.
>
> Don’t you think so?
>
>
>
> *From: *Yujun Zhang <zhangyujun+zte at gmail.com>
>
>
> *Reply-To: *"OpenStack Development Mailing List (not for usage
> questions)" <openstack-dev at lists.openstack.org>
>
> *Date: *Wednesday, 11 January 2017 at 09:46
> *To: *"OpenStack Development Mailing List (not for usage questions)" <
> openstack-dev at lists.openstack.org>, "yinliyin at zte.com.cn" <
> yinliyin at zte.com.cn>
> *Cc: *"han.jing28 at zte.com.cn" <han.jing28 at zte.com.cn>, "
> wang.weiya at zte.com.cn" <wang.weiya at zte.com.cn>, "zhang.yujunz at zte.com.cn"
> <zhang.yujunz at zte.com.cn>, "jia.peiyuan at zte.com.cn" <
> jia.peiyuan at zte.com.cn>, "gong.yahui5 at zte.com.cn" <gong.yahui5 at zte.com.cn>
>
>
> *Subject: *Re: [openstack-dev] [Vitrage] About alarms reported by
> datasource and the alarms generated by vitrage evaluator
>
>
>
> Hi, Ifat
>
>
>
> If I understand it correctly, your concerns are mainly on same alarm from
> different monitor, but not "suspect" status as discussed in another thread.
>
>
>
> On Tue, Jan 10, 2017 at 10:21 PM Afek, Ifat (Nokia - IL) <
> ifat.afek at nokia.com> wrote:
>
> Hi Yinliyin,
>
>
>
> At first I thought that changing the deduced to be a property on the alarm
> might help in solving your use case. But now I think most of the problems
> will remain the same:
>
>
>
> · It won’t solve the general problem of two different monitors that
> raise the same alarm
>
> · It won’t solve possible conflicts of timestamp and severity between
> different monitors
>
> · It will make the decision of when to delete the alarm more complex
> (delete it when the deduced alarm is deleted? When Nagios alarm is deleted?
> both? And how to change the timestamp and severity in these cases?)
>
>
>
> So I don’t think that making this change is beneficial.
>
> What do you think?
>
>
>
> Best Regards,
>
> Ifat.
>
>
>
>
>
> *From: *"yinliyin at zte.com.cn" <yinliyin at zte.com.cn>
> *Date: *Monday, 9 January 2017 at 05:29
> *To: *"Afek, Ifat (Nokia - IL)" <ifat.afek at nokia.com>
> *Cc: *"openstack-dev at lists.openstack.org" <
> openstack-dev at lists.openstack.org>, "han.jing28 at zte.com.cn" <
> han.jing28 at zte.com.cn>, "wang.weiya at zte.com.cn" <wang.weiya at zte.com.cn>, "
> zhang.yujunz at zte.com.cn" <zhang.yujunz at zte.com.cn>, "
> jia.peiyuan at zte.com.cn" <jia.peiyuan at zte.com.cn>, "gong.yahui5 at zte.com.cn"
> <gong.yahui5 at zte.com.cn>
> *Subject: *Re: [openstack-dev] [Vitrage] About alarms reported by
> datasource and the alarms generated by vitrage evaluator
>
>
>
> Hi Ifat,
>
> I think there is a situation that all the alarms are reported by
> the monitored system. We use vitrage to:
>
> 1. Found the relationships of the alarms, and find the root
> cause.
>
> 2. Deduce the alarm before it really occured. This comprise
> two aspects:
>
> 1) A cause B: When A occured, we deduce that B would
> occur
>
> 2) B is caused by A: When B occured, we deduce that A
> must occured
>
> In "2", we do expect vitrage to raise the alarm before the
> alarm is reported because the alarm would be lost or be delayed for some
> reason. So we would write "raise alarm" actions in the scenarios of the
> template. I think that the alarm is reported or is deduced should be a
> state property of the alarm. The vertex reported and the vertex deduced of
> the same alarm should be merged to one vertex.
>
>
>
> Best Regards,
>
> Yinliyin.
>
> 原始邮件
>
> *发件人:* <ifat.afek at nokia.com>;
>
> *收件人:* <openstack-dev at lists.openstack.org>;
>
> *抄送人:*韩静00006838;王维雅00042110;章宇军10200531;贾培源10101785;龚亚辉6092001895
> <(609)%20200-1895>;
>
> *日* *期* *:*2017年01月07日 02:18
>
> *主* *题* *:**Re: [openstack-dev] [Vitrage] About alarms reported by
> datasource and the alarms generated by vitrage evaluator*
>
>
>
> Hi YinLiYin,
>
>
>
> This is an interesting question. Let me divide my answer to two parts.
>
>
>
> First, the case that you described with Nagios and Vitrage. This problem
> depends on the specific Nagios tests that you configure in your system, as
> well as on the Vitrage templates that you use. For example, you can use
> Nagios/Zabbix to monitor the physical layer, and Vitrage to raise deduced
> alarms on the virtual and application layers. This way you will never have
> duplicated alarms. If you want to use Nagios to monitor the other layers
> as well, you can simply modify Vitrage templates so they don’t raise the
> deduced alarms that Nagios may generate, and use the templates to show RCA
> between different Nagios alarms.
>
>
>
> Now let’s talk about the more general case. Vitrage can receive alarms
> from different monitors, including Nagios, Zabbix, collectd and Aodh. If
> you are using more than one monitor, it is possible that the same alarm
> (maybe with a different name) will be raised twice. We need to create a
> mechanism to identify such cases and create a single alarm with the
> properties of both monitors. This has not been designed in details yet, so
> if you have any suggestion we will be happy to hear them.
>
>
>
> Best Regards,
>
> Ifat.
>
>
>
>
>
> *From: *"yinliyin at zte.com.cn" <yinliyin at zte.com.cn>
> *Reply-To: *"OpenStack Development Mailing List (not for usage
> questions)" <openstack-dev at lists.openstack.org>
> *Date: *Friday, 6 January 2017 at 03:27
> *To: *"openstack-dev at lists.openstack.org" <
> openstack-dev at lists.openstack.org>
> *Cc: *"gong.yahui5 at zte.com.cn" <gong.yahui5 at zte.com.cn>, "
> han.jing28 at zte.com.cn" <han.jing28 at zte.com.cn>, "wang.weiya at zte.com.cn" <
> wang.weiya at zte.com.cn>, "jia.peiyuan at zte.com.cn" <jia.peiyuan at zte.com.cn>,
> "zhang.yujunz at zte.com.cn" <zhang.yujunz at zte.com.cn>
> *Subject: *[openstack-dev] [Vitrage] About alarms reported by datasource
> and the alarms generated by vitrage evaluator
>
>
>
> Hi all,
>
> Vitrage generate alarms acording to the templates. All the alarms
> raised by vitrage has the type "vitrage". Suppose Nagios has an alarm A.
> Alarm A is raised by vitrage evaluator according to the action part of a
> scenario, type of alarm A is "vitrage". If Nagios reported alarm A latter,
> a new alarm A with type "Nagios" would be generator in the entity graph.
> There would be two vertices for the same alarm in the graph. And we have
> to define two alarm entities, two relationships, two scenarios in the
> template file to make the alarm propagation procedure work.
>
> It is inconvenient to describe fault model of system with lot of
> alarms. How to solve this problem?
>
>
>
> 殷力殷 YinLiYin
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170111/82a8ae86/attachment.html>
More information about the OpenStack-dev
mailing list