[openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

Yujun Zhang zhangyujun+zte at gmail.com
Tue Jan 10 05:34:57 UTC 2017


I prefer 2.b from instinct.

Not sure it could be linked to the vitrage_id[1] evolution. If an uuid is
created for the alarm, the implementation could be quite straightforward.

[1]: https://blueprints.launchpad.net/vitrage/+spec/standard-vitrage-id

On Tue, Jan 10, 2017 at 1:55 AM Afek, Ifat (Nokia - IL) <ifat.afek at nokia.com>
wrote:

> Hi Yujun,
>
>
>
> I understand the use case now, thanks for the detailed explanation.
>
>
>
> Supporting this use case will require some development in Vitrage. Let me
> try to list down the requirements and options that we have.
>
>
>
> 1.       Requirement: Raise ‘suspect’ deduced alarms in Vitrage.
>
> Implementation: Quite straight forward. There is no way to set ‘suspect’
> property in Vitrage right now, but it should be easy to add this option.
>
>
>
> 2.       Requirement: Change a ‘suspect’ alarm of type ‘vitrage’ to a
> ‘real’ alarm of type ‘nagios’.
>
> Implementation: There are a few alternatives how to achieve this goal
>
>
>
> a.       Delete the ‘suspect’ alarm and create the ‘real’ alarm. This
> will require supporting ‘not’ condition in the templates. An example
> scenario:
>
> condition: vm_alarm and not nagios_alarm:
>
>    (action: create vitrage alarm)
>
> condition: nagios_alarm and vitrage_alarm:
>
>    (action: delete vitrage_alarm)
>
>
>
> b.       Have both ‘suspect’ alarm and ‘real’ alarm, and create a
> ‘equivalent’ relationship between them. Configuring the template should be
> easy, however it won’t look nice in the UI. In past discussions we
> mentioned an option to group some vertices together in the UI. If we have
> this option, we might want to group these two alarms together.
>
>
>
> c.       Merge the two alarms. This solution seems the most reasonable
> one at first, but it is not trivial. For example: suppose one alarm is
> defined as ‘critical’ and was raised at 10:01, and the other alarm was
> defined as ‘warning’ and was raised at 10:02. How will you combine the two?
> And what if the ‘critical’ alarm then goes down, will you know how to
> change the severity back to ‘warning’? in case of vitrage vs. nagios we
> would like to prefer nagios; but let’s think of the more general case of
> two different monitors.
>
>
>
> 3.       In one of your emails you mentioned an option of having two
> ‘suspects’. Suppose vm_alarm is raised, will you raise two suspect vitrage
> alarms, e.g. host_alarm and switch_alarm? And if you then receive
> host_alarm from nagios, would you like to delete the suspect switch_alarm,
> or keep it? If you would like to delete it, it will require supporting
> ‘not’ in the template condition.
>
>
>
> Personally I would go for option 2b, but I will be happy to hear your
> thoughts about it.
>
>
>
> Hope I helped, but I suspect I just made things more complicated ;-)
>
> Ifat.
>
>
>
>
>
> *From: *Yujun Zhang <zhangyujun+zte at gmail.com>
>
>
> *Reply-To: *"OpenStack Development Mailing List (not for usage
> questions)" <openstack-dev at lists.openstack.org>
>
> *Date: *Sunday, 8 January 2017 at 17:38
>
>
> *To: *"OpenStack Development Mailing List (not for usage questions)" <
> openstack-dev at lists.openstack.org>
> *Cc: *"han.jing28 at zte.com.cn" <han.jing28 at zte.com.cn>, "
> wang.weiya at zte.com.cn" <wang.weiya at zte.com.cn>, "gong.yahui5 at zte.com.cn" <
> gong.yahui5 at zte.com.cn>, "jia.peiyuan at zte.com.cn" <jia.peiyuan at zte.com.cn>,
> "zhang.yujunz at zte.com.cn" <zhang.yujunz at zte.com.cn>
> *Subject: *Re: [openstack-dev] [Vitrage] About alarms reported by
> datasource and the alarms generated by vitrage evaluator
>
> Maybe I have missed something in the scenario template, but it seems you
> have understood my idea quite correctly :-)
>
>
>
> See further explanation inline
>
> On Sun, Jan 8, 2017 at 3:06 PM Afek, Ifat (Nokia - IL) <
> ifat.afek at nokia.com> wrote:
>
> Hi Yujun,
>
>
>
> Thanks for the explanation, but I still don’t fully understand.
>
>
>
> Let me start with the current state:
>
> 1.       introduce a flexible `metadata` dict in to ALARM entity
>
> [Ifat] Already exists. An alarm is represented as a vertex in the entity
> graph, with a dictionary of properties.
>
>
>
>  [yujunz] Can the alarm vertex be updated by scenario action? e.g. raise
> an alarm and set the property `suspect` to true.
>
>
>
> 2.       Allow generating update event[1] on metadata change
>
> 3.       Allow using ALARM metadata in scenario condition
>
> [Ifat] Already exists. You can define properties in the ‘entities’ section
> in Vitrage templates
>
>
>
> [yujunz] How do I specify the condition if one specified alarm is
> 'suspicious', e.g. condition: host_alarm.suspect ?
>
>
>
> 4.       Allow setting ALARM metadata in scenario action
>
>
>
> If I understand correctly, you are suggesting that one scenario will add
> metadata to an existing alarm, which will trigger an event, and as a result
> another scenario might be executed?
>
>
>
> [yujunz] Exactly
>
>
>
> Can you describe a use case where this behavior will help calculating the
> root cause?
>
>
>
> [yujunz] Here's the simplified case derived from YinLiYin's example.
> Suppose we add a causal relationship from `host_alarm` to `instance_alarm`,
> i.e. host alarm will cause instance alarm. If an instance alarm is detected
> (but no host alarm). It is "suspicious" that it may be caused by host
> alarm. The reason could be event delay or lost. Instead of waiting for
> snapshot service to update the host status, we want to run a diagnostic
> action to check it initiatively.
>
>
>
> In this case, we want to set the upstream (host) of a confirmed alarm
> (instance) to "suspect" and trigger an diagnostic action on this change.
>
>
>
> Hope that I have made the use case clear.
>
>
>
> Thanks,
>
> Ifat.
>
>
>
>
>
> *From: *Yujun Zhang <zhangyujun+zte at gmail.com>
>
>
> *Reply-To: *"OpenStack Development Mailing List (not for usage
> questions)" <openstack-dev at lists.openstack.org>
>
> *Date: *Saturday, 7 January 2017 at 09:27
>
>
> *To: *"OpenStack Development Mailing List (not for usage questions)" <
> openstack-dev at lists.openstack.org>
>
> *Cc: *"han.jing28 at zte.com.cn" <han.jing28 at zte.com.cn>, "
> wang.weiya at zte.com.cn" <wang.weiya at zte.com.cn>, "gong.yahui5 at zte.com.cn" <
> gong.yahui5 at zte.com.cn>, "jia.peiyuan at zte.com.cn" <jia.peiyuan at zte.com.cn>,
> "zhang.yujunz at zte.com.cn" <zhang.yujunz at zte.com.cn>
> *Subject: *Re: [openstack-dev] [Vitrage] About alarms reported by
> datasource and the alarms generated by vitrage evaluator
>
>
>
> The two questions raised by YinLiYin is actually one, i.e. *how to enrich
> the alarm properties *that can be used as an condition in root cause
> deducing.
>
>
>
> Both 'suspect' or 'datasource' are additional information that may be
> referred as a condition in general fault model, a.k.a. scenario in vitrage.
>
>
>
> It seems it could be done by
>
>    1. introduce a flexible `metadata` dict in to ALARM entity
>
> 2.      Allow generating update event[1] on metadata change
>
> 3.      Allow using ALARM metadata in scenario condition
>
> 4.      Allow setting ALARM metadata in scenario action
>
> This will leave the flexibility to continuous development by defining a
> complex scenario template and keep the vitrage evaluator simple and generic.
>
>
>
> My two cents.
>
>
>
> [1]:
> http://docs.openstack.org/developer/vitrage/scenario-evaluator.html#concepts-and-guidelines
>
>
>
>
> On Sat, Jan 7, 2017 at 2:23 AM Afek, Ifat (Nokia - IL) <
> ifat.afek at nokia.com> wrote:
>
> Hi YinLiYin,
>
>
>
> This is an interesting question. Let me divide my answer to two parts.
>
>
>
> First, the case that you described with Nagios and Vitrage. This problem
> depends on the specific Nagios tests that you configure in your system, as
> well as on the Vitrage templates that you use. For example, you can use
> Nagios/Zabbix to monitor the physical layer, and Vitrage to raise deduced
> alarms on the virtual and application layers. This way you will never have
> duplicated alarms. If you want to use Nagios to monitor the other layers as
> well, you can simply modify Vitrage templates so they don’t raise the
> deduced alarms that Nagios may generate, and use the templates to show RCA
> between different Nagios alarms.
>
>
>
> Now let’s talk about the more general case. Vitrage can receive alarms
> from different monitors, including Nagios, Zabbix, collectd and Aodh. If
> you are using more than one monitor, it is possible that the same alarm
> (maybe with a different name) will be raised twice. We need to create a
> mechanism to identify such cases and create a single alarm with the
> properties of both monitors. This has not been designed in details yet, so
> if you have any suggestion we will be happy to hear them.
>
>
>
> Best Regards,
>
> Ifat.
>
>
>
>
>
> *From: *"yinliyin at zte.com.cn" <yinliyin at zte.com.cn>
> *Reply-To: *"OpenStack Development Mailing List (not for usage
> questions)" <openstack-dev at lists.openstack.org>
> *Date: *Friday, 6 January 2017 at 03:27
> *To: *"openstack-dev at lists.openstack.org" <
> openstack-dev at lists.openstack.org>
> *Cc: *"gong.yahui5 at zte.com.cn" <gong.yahui5 at zte.com.cn>, "
> han.jing28 at zte.com.cn" <han.jing28 at zte.com.cn>, "wang.weiya at zte.com.cn" <
> wang.weiya at zte.com.cn>, "jia.peiyuan at zte.com.cn" <jia.peiyuan at zte.com.cn>,
> "zhang.yujunz at zte.com.cn" <zhang.yujunz at zte.com.cn>
> *Subject: *[openstack-dev] [Vitrage] About alarms reported by datasource
> and the alarms generated by vitrage evaluator
>
> Hi all,
>
>    Vitrage generate alarms acording to the templates. All the alarms
> raised by vitrage has the type "vitrage". Suppose Nagios has an alarm A.
> Alarm A is raised by vitrage evaluator according to the action part of a
> scenario, type of alarm A is "vitrage". If Nagios reported alarm A latter,
> a new alarm A with type "Nagios" would be generator in the entity graph.
>   There would be two vertices for the same alarm in the graph. And we have
> to define two alarm entities, two relationships, two scenarios in the
> template file to make the alarm propagation procedure work.
>
>    It is inconvenient to describe fault model of system with lot of
> alarms. How to solve this problem?
>
>
>
> 殷力殷 YinLiYin
>
>
>
>
>
> *Error! Filename not specified.*
>
> *Error! Filename not specified.*
>
> 上海市浦东新区碧波路889号中兴研发大楼D502
> D502, ZTE Corporation R&D Center, 889# Bibo Road,
> Zhangjiang Hi-tech Park, Shanghai, P.R.China, 201203
> T: +86 21 68896229 <+86%2021%206889%206229>
> M: +86 13641895907 <+86%20136%204189%205907>
> E: yinliyin at zte.com.cn
> www.zte.com.cn
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170110/0073899b/attachment.html>


More information about the OpenStack-dev mailing list