[openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

Afek, Ifat (Nokia - IL) ifat.afek at nokia.com
Mon Jan 9 17:47:44 UTC 2017


Hi Yujun,

I understand the use case now, thanks for the detailed explanation.

Supporting this use case will require some development in Vitrage. Let me try to list down the requirements and options that we have.


1.       Requirement: Raise ‘suspect’ deduced alarms in Vitrage.

Implementation: Quite straight forward. There is no way to set ‘suspect’ property in Vitrage right now, but it should be easy to add this option.



2.       Requirement: Change a ‘suspect’ alarm of type ‘vitrage’ to a ‘real’ alarm of type ‘nagios’.

Implementation: There are a few alternatives how to achieve this goal



a.       Delete the ‘suspect’ alarm and create the ‘real’ alarm. This will require supporting ‘not’ condition in the templates. An example scenario:

condition: vm_alarm and not nagios_alarm:

   (action: create vitrage alarm)

condition: nagios_alarm and vitrage_alarm:

   (action: delete vitrage_alarm)



b.       Have both ‘suspect’ alarm and ‘real’ alarm, and create a ‘equivalent’ relationship between them. Configuring the template should be easy, however it won’t look nice in the UI. In past discussions we mentioned an option to group some vertices together in the UI. If we have this option, we might want to group these two alarms together.



c.       Merge the two alarms. This solution seems the most reasonable one at first, but it is not trivial. For example: suppose one alarm is defined as ‘critical’ and was raised at 10:01, and the other alarm was defined as ‘warning’ and was raised at 10:02. How will you combine the two? And what if the ‘critical’ alarm then goes down, will you know how to change the severity back to ‘warning’? in case of vitrage vs. nagios we would like to prefer nagios; but let’s think of the more general case of two different monitors.


3.       In one of your emails you mentioned an option of having two ‘suspects’. Suppose vm_alarm is raised, will you raise two suspect vitrage alarms, e.g. host_alarm and switch_alarm? And if you then receive host_alarm from nagios, would you like to delete the suspect switch_alarm, or keep it? If you would like to delete it, it will require supporting ‘not’ in the template condition.

Personally I would go for option 2b, but I will be happy to hear your thoughts about it.

Hope I helped, but I suspect I just made things more complicated ;-)
Ifat.


From: Yujun Zhang <zhangyujun+zte at gmail.com>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org>
Date: Sunday, 8 January 2017 at 17:38
To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org>
Cc: "han.jing28 at zte.com.cn" <han.jing28 at zte.com.cn>, "wang.weiya at zte.com.cn" <wang.weiya at zte.com.cn>, "gong.yahui5 at zte.com.cn" <gong.yahui5 at zte.com.cn>, "jia.peiyuan at zte.com.cn" <jia.peiyuan at zte.com.cn>, "zhang.yujunz at zte.com.cn" <zhang.yujunz at zte.com.cn>
Subject: Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

Maybe I have missed something in the scenario template, but it seems you have understood my idea quite correctly :-)

See further explanation inline
On Sun, Jan 8, 2017 at 3:06 PM Afek, Ifat (Nokia - IL) <ifat.afek at nokia.com<mailto:ifat.afek at nokia.com>> wrote:
Hi Yujun,

Thanks for the explanation, but I still don’t fully understand.

Let me start with the current state:
1.       introduce a flexible `metadata` dict in to ALARM entity
[Ifat] Already exists. An alarm is represented as a vertex in the entity graph, with a dictionary of properties.

 [yujunz] Can the alarm vertex be updated by scenario action? e.g. raise an alarm and set the property `suspect` to true.

2.       Allow generating update event[1] on metadata change
3.       Allow using ALARM metadata in scenario condition
[Ifat] Already exists. You can define properties in the ‘entities’ section in Vitrage templates

[yujunz] How do I specify the condition if one specified alarm is 'suspicious', e.g. condition: host_alarm.suspect ?

4.       Allow setting ALARM metadata in scenario action

If I understand correctly, you are suggesting that one scenario will add metadata to an existing alarm, which will trigger an event, and as a result another scenario might be executed?

[yujunz] Exactly

Can you describe a use case where this behavior will help calculating the root cause?

[yujunz] Here's the simplified case derived from YinLiYin's example. Suppose we add a causal relationship from `host_alarm` to `instance_alarm`, i.e. host alarm will cause instance alarm. If an instance alarm is detected (but no host alarm). It is "suspicious" that it may be caused by host alarm. The reason could be event delay or lost. Instead of waiting for snapshot service to update the host status, we want to run a diagnostic action to check it initiatively.

In this case, we want to set the upstream (host) of a confirmed alarm (instance) to "suspect" and trigger an diagnostic action on this change.

Hope that I have made the use case clear.

Thanks,
Ifat.


From: Yujun Zhang <zhangyujun+zte at gmail.com<mailto:zhangyujun%2Bzte at gmail.com>>

Reply-To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org<mailto:openstack-dev at lists.openstack.org>>
Date: Saturday, 7 January 2017 at 09:27

To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org<mailto:openstack-dev at lists.openstack.org>>
Cc: "han.jing28 at zte.com.cn<mailto:han.jing28 at zte.com.cn>" <han.jing28 at zte.com.cn<mailto:han.jing28 at zte.com.cn>>, "wang.weiya at zte.com.cn<mailto:wang.weiya at zte.com.cn>" <wang.weiya at zte.com.cn<mailto:wang.weiya at zte.com.cn>>, "gong.yahui5 at zte.com.cn<mailto:gong.yahui5 at zte.com.cn>" <gong.yahui5 at zte.com.cn<mailto:gong.yahui5 at zte.com.cn>>, "jia.peiyuan at zte.com.cn<mailto:jia.peiyuan at zte.com.cn>" <jia.peiyuan at zte.com.cn<mailto:jia.peiyuan at zte.com.cn>>, "zhang.yujunz at zte.com.cn<mailto:zhang.yujunz at zte.com.cn>" <zhang.yujunz at zte.com.cn<mailto:zhang.yujunz at zte.com.cn>>
Subject: Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

The two questions raised by YinLiYin is actually one, i.e. how to enrich the alarm properties that can be used as an condition in root cause deducing.

Both 'suspect' or 'datasource' are additional information that may be referred as a condition in general fault model, a.k.a. scenario in vitrage.

It seems it could be done by

  1.  introduce a flexible `metadata` dict in to ALARM entity
2.      Allow generating update event[1] on metadata change
3.      Allow using ALARM metadata in scenario condition
4.      Allow setting ALARM metadata in scenario action
This will leave the flexibility to continuous development by defining a complex scenario template and keep the vitrage evaluator simple and generic.

My two cents.

[1]: http://docs.openstack.org/developer/vitrage/scenario-evaluator.html#concepts-and-guidelines

On Sat, Jan 7, 2017 at 2:23 AM Afek, Ifat (Nokia - IL) <ifat.afek at nokia.com<mailto:ifat.afek at nokia.com>> wrote:
Hi YinLiYin,

This is an interesting question. Let me divide my answer to two parts.

First, the case that you described with Nagios and Vitrage. This problem depends on the specific Nagios tests that you configure in your system, as well as on the Vitrage templates that you use. For example, you can use Nagios/Zabbix to monitor the physical layer, and Vitrage to raise deduced alarms on the virtual and application layers. This way you will never have duplicated alarms. If you want to use Nagios to monitor the other layers as well, you can simply modify Vitrage templates so they don’t raise the deduced alarms that Nagios may generate, and use the templates to show RCA between different Nagios alarms.

Now let’s talk about the more general case. Vitrage can receive alarms from different monitors, including Nagios, Zabbix, collectd and Aodh. If you are using more than one monitor, it is possible that the same alarm (maybe with a different name) will be raised twice. We need to create a mechanism to identify such cases and create a single alarm with the properties of both monitors. This has not been designed in details yet, so if you have any suggestion we will be happy to hear them.

Best Regards,
Ifat.


From: "yinliyin at zte.com.cn<mailto:yinliyin at zte.com.cn>" <yinliyin at zte.com.cn<mailto:yinliyin at zte.com.cn>>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org<mailto:openstack-dev at lists.openstack.org>>
Date: Friday, 6 January 2017 at 03:27
To: "openstack-dev at lists.openstack.org<mailto:openstack-dev at lists.openstack.org>" <openstack-dev at lists.openstack.org<mailto:openstack-dev at lists.openstack.org>>
Cc: "gong.yahui5 at zte.com.cn<mailto:gong.yahui5 at zte.com.cn>" <gong.yahui5 at zte.com.cn<mailto:gong.yahui5 at zte.com.cn>>, "han.jing28 at zte.com.cn<mailto:han.jing28 at zte.com.cn>" <han.jing28 at zte.com.cn<mailto:han.jing28 at zte.com.cn>>, "wang.weiya at zte.com.cn<mailto:wang.weiya at zte.com.cn>" <wang.weiya at zte.com.cn<mailto:wang.weiya at zte.com.cn>>, "jia.peiyuan at zte.com.cn<mailto:jia.peiyuan at zte.com.cn>" <jia.peiyuan at zte.com.cn<mailto:jia.peiyuan at zte.com.cn>>, "zhang.yujunz at zte.com.cn<mailto:zhang.yujunz at zte.com.cn>" <zhang.yujunz at zte.com.cn<mailto:zhang.yujunz at zte.com.cn>>
Subject: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator


Hi all,

   Vitrage generate alarms acording to the templates. All the alarms raised by vitrage has the type "vitrage". Suppose Nagios has an alarm A. Alarm A is raised by vitrage evaluator according to the action part of a scenario, type of alarm A is "vitrage". If Nagios reported alarm A latter, a new alarm A with type "Nagios" would be generator in the entity graph.     There would be two vertices for the same alarm in the graph. And we have to define two alarm entities, two relationships, two scenarios in the template file to make the alarm propagation procedure work.

   It is inconvenient to describe fault model of system with lot of alarms. How to solve this problem?



殷力殷 YinLiYin




Error! Filename not specified.

Error! Filename not specified.
上海市浦东新区碧波路889号中兴研发大楼D502
D502, ZTE Corporation R&D Center, 889# Bibo Road,
Zhangjiang Hi-tech Park, Shanghai, P.R.China, 201203
T: +86 21 68896229<tel:+86%2021%206889%206229>
M: +86 13641895907<tel:+86%20136%204189%205907>
E: yinliyin at zte.com.cn<mailto:yinliyin at zte.com.cn>
www.zte.com.cn<http://www.zte.com.cn/>



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe<http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe<http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170109/8f9e1c60/attachment.html>


More information about the OpenStack-dev mailing list