[openstack-dev] [ceilometer][aodh][vitrage] Raising custom alarms in AODH

Ryota Mibu r-mibu at cq.jp.nec.com
Fri Dec 4 07:42:20 UTC 2015


Hi Ifat,


> > > Let me see if I got this right: are you suggesting that we create
> > > on-the-fly alarm definitions with no alarm_actions, for every
> > > deduced
> > alarm that we want to raise? And this will spare us the extra alarm
> > evaluation in AODH?
> >
> > Yes. But, please note that could be the first step. The next step
> > would be make vitrage to send out alarm event to ceilometer/aodh the
> > pre- configured event alarm will recognize the alarm and fire the
> > alarm notification to another service or an end user. Eventually, we
> > should have relevant alarm type and evaluator to proxy evaluation in
> > vitrage, I think.
> 
> The next step can happen if and when Aodh supports alarm templates.
> If Vitrage can handle about 30 alarm types, and there are 100 instances, we don't want to pre-configure 3000 alarms,
> which most likely will never be triggered.


I understand your concern. Aodh is user facing service, so having lots of alarms doesn't make sense.

Can we clarify use case again in terms of service role definition?

Aodh provides alarming mechanism to *notify* events and situations calculated from various data sources. But, original/master information of resource including latest resource state is owned by other services such as nova.

So, user who wants to know current resource state to find out dead resources (instances), can simply query instances via nova api. And, user who wants to know when/what failure occurred can query events via ceilometer api. Aodh has alarm state and history though.



> > > Another question is our need to get alarms from other sources, like
> > > Nagios, zabbix, ganglia, etc. We thought that Vitrage would query
> > > these Alarms from each source directly, and then create alarms in
> > AODH in the same way as our deduced alarms: for example create
> > nagios_ovs_vswitchd alarm if nagios check_ovs_vswitchd test failed.
> > > An alternative could be to integrate nagios directly with AODH.
> > > What do you think?
> >
> > Hmm, I don't have clear view on this. If the source can includes
> > OpenStack IDs and can be generate relevant meter/sample, it should be
> > useful to integrate with ceilometer. But if you want to do some
> > operations (like correlation), then it is reasonable to integrate with
> > vitrage.
> 
> The source may include alarms on resources that are not defined in OpenStack, like switches or ports. And the alarms
> are not necessarily related to meters, they can be test nagios failures for example.


Yes, so it depends on type of resource and its parameter.



> > > > BTW, is it useful to have on-the-fly evaluation of combination
> > alarm
> > > > with event alarms for alarm aggregation or other cases?
> > >
> > > I'm not sure I understand. Can you give a detailed example?
> >
> > OK. The 'combination' type alarm enables you to aggregate multiple
> > alarm to one alarm. This can be used when you want to receive alarm
> > when the both of physical NIC ports are downed to recognize logical
> > connection unavailability if the ports are teamed for redundancy. Now,
> > the combination alarms are evaluated periodically that means you can
> > receive combination alarm not on-the-fly while you are using event
> > alarms as source of combination alarm though.
> 
> I think I understand your point. It means that certain alarms will arrive to Vitrage in delay, due to your evaluation
> policy. I think we will have to address this issue at some point, but it won't change our overall design.

Yes, I'm just curious if there is any user can get benefit from this improvement to set priority.



> > > In addition, in Vitrage we plan to handle alarm aggregation by
> > > creating aggregation rule templates, for example based on the RCA
> > information.
> > > The user will be able to see only the root cause alarms, and then
> > > drill down to all specific alarms. But I doubt if this will be done
> > for Mitaka.
> >
> > I think 'the RCA information' means information for RCA. I mean
> > vitrage will use the resource topologies or relationship in
> > aggregation, rather than result of RCA. Am I right?
> 
> The term "aggregation" is used in different contexts, which may be confusing. Our plan is to examine the already-computed
> RCA information, and see, for example, that a switch failure alarm caused alarms on 100 related instances. In horizon,
> the result will be 101 alarms shown to the user in a flat list.
> By "alarm aggregation based on RCA" we mean that we will have an API to get root cause alarms, which will return only
> the switch alarm. The horizon user will see one alarm, and may then ask to expand the view and see all the other alarms
> that were caused by it.

I see. I used the term "aggregation" for aggregation process in alarm evaluation.



Thanks,
Ryota




More information about the OpenStack-dev mailing list