[openstack-dev] [vitrage] Feedback on ability to 'suppress' alarms by type and/or resource in Vitrage

Waines, Greg Greg.Waines at windriver.com
Fri Dec 1 13:45:50 UTC 2017


Hey,

I am interested in getting some feedback on a proposed blueprint for Vitrage.

BLUEPRINT:

TITLE: Add the ability to ‘suppress’ alarms by Alarm Type and/or Resource

When managing a cloud, there are situations where a particular alarm or a set of alarms from a particular resource are occurring frequently, however they are identifying issues that are not of concern, at least for the time being.  For example, new hardware is in the process of being installed and resulting in alarms to occur, or remote servers (e.g. NTP Servers) are unreliable and result in frequent connectivity alarms.   In these situations, these irrelevant alarms are cluttering the alarm displays and it would be helpful to be able to suppress these alarms.

Suppressed alarms would not be shown in Active Alarm lists or Historical Alarm lists, and would not be included in alarm counts.
There would be a CLI Option / Horizon Button, to enable looking at Alarms that are currently suppressed.
( i.e. the idea would be that suppressed alarms would still be tracked, they just would not be displayed by default)

Thoughts on usefulness ?



Questions on how to implement this in Vitrage

·         from an end user’s point of view, alarms have the following fields

o    vitrage_id (uuid) - unique identifier of an instance of an alarm

o    vitrage_type (enum) - e.g. collectd, nagios, zabbix, vitrage, ...
                                      - really an identifier of the general entity reporting the alarm

o    name (string) - the alarm description

o    vitrage_resource_type (enum) - e.g. nova.instance, nova.host, port, ...

o    vitrage_resource_id (uuid) - resource instance

o    vitrage_aggregated_severity

o    vitrage_operational_severity

o    update_timestamp

·

·         there definitely is a specific resource identifier in order to be able to suppress alarms from a particular resource

·

·         BUT there doesn’t seem like there is a general alarm type field
i.e. that would classify the type of problem that’s occurring
e.g.

o    communication failure with compute host

o    loss-of-signal on port of compute host

o    loss of connectivity with NTP Server

o    CPU Threshold exceeded on compute host

o    Memory Threshold exceeded on compute host

o    File System Threshold exceeded on compute host

o    etc.

·         ... which would be type/granularity of ‘Alarm Type’ that i would think the user would want to suppress alarms based on.

·         i.e. it seems like the ‘name’ field is a combination of this general Alarm Type and details on the particular alarm.

·

·         Any thoughts on adding a ‘vitrage_alarm_type (enum or short string)’ as a mechanism to identify the general type of problem or alarm being reported in order to address this ?

o    could be an optional field

o    but we’d display in the alarm list

o    and we’d use it as the mechanism to suppress alarms by ‘type’

         Let me know what you think ?


Greg.







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20171201/39fdca1e/attachment.html>


More information about the OpenStack-dev mailing list