Fw: [congress] Handling alarms that can be erroneous
AKHIL Jain
akhil.jain at india.nec.com
Mon Feb 25 03:09:36 UTC 2019
Hi all,
This discussion is about keeping, managing and executing actions based on old alarms.
In Congress, when the policy is created the corresponding actions are executed based on data already existing in datasource tables and on the data that is received later in Congress datasource tables.
So the alarms raised by projects like aodh, monasca are polled by congress and even the webhook notifications for alarm are received and stored in congress.
In Congress, there are two scenarios of policy execution. One, execution based on data already existing before the policy is created and second, policy is created and action is executed at any time after the data is received
Which can be harmful by keeping in mind that old alarms that are INVALID at present are still stored in Congress tables. So the user can trigger FALSE action based on that invalid alarm which can be very harmful to the environment.
In order to tackle this, there can be multiple ways from the perspective of every OpenStack project handling alarms.
One of the solutions can be: As action needs to be taken immediately after the alarm is raised, so storing only those alarms that have corresponding actions or policies(that will use the alarm) and after the policy is executed on them just discard those alarms or mark those alarm with some field like old, executed, etc. Or there are use cases that require old alarms?
Also, we need to provide Operator the ability to delete the rows in congress datasource table. This will not completely help in solving this issue but still, it's better functionality to have IMO.
Above solution or any discussed better solution can lead to change in mechanism i.e currently followed that involves policy execution on both new alarm and existing alarm to only new alarm.
I have added the previous discussion below and discussion in Congress weekly IRC meeting can be found here
http://eavesdrop.openstack.org/meetings/congressteammeeting/2019/congressteammeeting.2019-02-22-04.01.log.html
Thanks and regards,
Akhil
________________________________________
From: Eric K <ekcs.openstack at gmail.com>
Sent: Tuesday, February 19, 2019 11:04 AM
To: AKHIL Jain
Subject: Re: Congress Demo and Output
Thanks for the update!
Yes of course if created_at field is needed by important use case then
please feel free to add it! Sample policy in the commit message would be
very helpful.
Regarding old alarms, I need a couple clarifications:
First, which categories of actions executions are we concerned about?
1. Actions executed automatically by congress policy.
2. Actions executed automatically by another service getting data from
Congress.
3. Actions executed manually by operator based on data from Congress.
Second, let's clarify exactly what we mean by "old".
There are several categories I can think of:
1. Alarms which had been activated and then deactivated.
2. Alarms which had been activated and remains active, but it has been
some time since it first became active.
3. Alarms which had been activated and triggered some action, but the
alarm remains active because the action do not resolve the alarm.
4. Alarms which had been activated and triggered some action, and the
action is in the process of resolving the alarm, but in the mean time the
alarm remains active.
(1) should generally not show up in Congress as active in push update
case, but there are failure scenarios in which an update to deactivate can
fail to reach Congress.
(2) seems to be the thing option 1.1 would get rid of. But I am not clear
what problems (2) causes. Why is a bad idea to execute actions based on an
alarm that has been active for some time and remains active? An example
would help me =)
I can see (4) causing problems. But I'd like to work through an example to
understand more concretely. In simple cases, Congress policy action
execution behavior actually works well.
If we have simple case like:
execute[action(1)] :- alarm(1)
Then action(1) is not going to be executed twice by congress because the
behavior is that Congress executes only the NEWLY COMPUTED actions.
If we have a more complex case like:
execute[action(1)] :- alarm(1)
execute[action(2)] :- alarm(1), alarm(2)
If alarm (1) activates first, triggering action(1), then alarm (2)
activates before alarm(1) deactivates, action(2) would be triggered
because it is newly computed. Whether we WANT it executed may depend on
the use case.
And I'd also like to add option 1.3:
Add a new table in (say monasca) called latest_alarm, which is the same as
the current alarms table, except that it contains only the most recently
received active alarm. That way, the policies which must avoid using older
alarms can refer to the latest_alarm table. Whereas policies which would
consider all currently active alarms can refer to the alarms table.
Looking forward to more discussion!
On 2/17/19, 10:44 PM, "AKHIL Jain" <akhil.jain at india.nec.com> wrote:
>Hi Eric,
>
>There are some questions raised while working on FaultManagement usecase,
>mainly below ones:
>1. Keeping old alarms can be very harmful, the operator can execute
>actions based on alarms that are not even existing or valid.
>2. Adding a created_at field in Nova servers table can be useful.
>
>So for the first question, there can be multiple options:
>1.1 Do not store those alarms that do not have any policy created in
>Congress to execute on that alarm
>1.2 Add field in alarm that can tell if the policy is executed using that
>row or not. And giving the operator a command to delete them or
>automatically delete them.
>
>For 2nd question please tell me that its good to go and I will add it.
>
>Regards
>Akhil
More information about the openstack-discuss
mailing list