Fw: [congress] Handling alarms that can be erroneous
Hi all, This discussion is about keeping, managing and executing actions based on old alarms. In Congress, when the policy is created the corresponding actions are executed based on data already existing in datasource tables and on the data that is received later in Congress datasource tables. So the alarms raised by projects like aodh, monasca are polled by congress and even the webhook notifications for alarm are received and stored in congress. In Congress, there are two scenarios of policy execution. One, execution based on data already existing before the policy is created and second, policy is created and action is executed at any time after the data is received Which can be harmful by keeping in mind that old alarms that are INVALID at present are still stored in Congress tables. So the user can trigger FALSE action based on that invalid alarm which can be very harmful to the environment. In order to tackle this, there can be multiple ways from the perspective of every OpenStack project handling alarms. One of the solutions can be: As action needs to be taken immediately after the alarm is raised, so storing only those alarms that have corresponding actions or policies(that will use the alarm) and after the policy is executed on them just discard those alarms or mark those alarm with some field like old, executed, etc. Or there are use cases that require old alarms? Also, we need to provide Operator the ability to delete the rows in congress datasource table. This will not completely help in solving this issue but still, it's better functionality to have IMO. Above solution or any discussed better solution can lead to change in mechanism i.e currently followed that involves policy execution on both new alarm and existing alarm to only new alarm. I have added the previous discussion below and discussion in Congress weekly IRC meeting can be found here http://eavesdrop.openstack.org/meetings/congressteammeeting/2019/congresstea... Thanks and regards, Akhil ________________________________________ From: Eric K <ekcs.openstack@gmail.com> Sent: Tuesday, February 19, 2019 11:04 AM To: AKHIL Jain Subject: Re: Congress Demo and Output Thanks for the update! Yes of course if created_at field is needed by important use case then please feel free to add it! Sample policy in the commit message would be very helpful. Regarding old alarms, I need a couple clarifications: First, which categories of actions executions are we concerned about? 1. Actions executed automatically by congress policy. 2. Actions executed automatically by another service getting data from Congress. 3. Actions executed manually by operator based on data from Congress. Second, let's clarify exactly what we mean by "old". There are several categories I can think of: 1. Alarms which had been activated and then deactivated. 2. Alarms which had been activated and remains active, but it has been some time since it first became active. 3. Alarms which had been activated and triggered some action, but the alarm remains active because the action do not resolve the alarm. 4. Alarms which had been activated and triggered some action, and the action is in the process of resolving the alarm, but in the mean time the alarm remains active. (1) should generally not show up in Congress as active in push update case, but there are failure scenarios in which an update to deactivate can fail to reach Congress. (2) seems to be the thing option 1.1 would get rid of. But I am not clear what problems (2) causes. Why is a bad idea to execute actions based on an alarm that has been active for some time and remains active? An example would help me =) I can see (4) causing problems. But I'd like to work through an example to understand more concretely. In simple cases, Congress policy action execution behavior actually works well. If we have simple case like: execute[action(1)] :- alarm(1) Then action(1) is not going to be executed twice by congress because the behavior is that Congress executes only the NEWLY COMPUTED actions. If we have a more complex case like: execute[action(1)] :- alarm(1) execute[action(2)] :- alarm(1), alarm(2) If alarm (1) activates first, triggering action(1), then alarm (2) activates before alarm(1) deactivates, action(2) would be triggered because it is newly computed. Whether we WANT it executed may depend on the use case. And I'd also like to add option 1.3: Add a new table in (say monasca) called latest_alarm, which is the same as the current alarms table, except that it contains only the most recently received active alarm. That way, the policies which must avoid using older alarms can refer to the latest_alarm table. Whereas policies which would consider all currently active alarms can refer to the alarms table. Looking forward to more discussion! On 2/17/19, 10:44 PM, "AKHIL Jain" <akhil.jain@india.nec.com> wrote:
Hi Eric,
There are some questions raised while working on FaultManagement usecase, mainly below ones: 1. Keeping old alarms can be very harmful, the operator can execute actions based on alarms that are not even existing or valid. 2. Adding a created_at field in Nova servers table can be useful.
So for the first question, there can be multiple options: 1.1 Do not store those alarms that do not have any policy created in Congress to execute on that alarm 1.2 Add field in alarm that can tell if the policy is executed using that row or not. And giving the operator a command to delete them or automatically delete them.
For 2nd question please tell me that its good to go and I will add it.
Regards Akhil
On Sun, Feb 24, 2019 at 7:14 PM AKHIL Jain <akhil.jain@india.nec.com> wrote:
Hi all,
This discussion is about keeping, managing and executing actions based on old alarms.
In Congress, when the policy is created the corresponding actions are executed based on data already existing in datasource tables and on the data that is received later in Congress datasource tables. So the alarms raised by projects like aodh, monasca are polled by congress and even the webhook notifications for alarm are received and stored in congress. In Congress, there are two scenarios of policy execution. One, execution based on data already existing before the policy is created and second, policy is created and action is executed at any time after the data is received
Fundamentally the current policy formalism is based on state. Policy is evaluated on the latest state, whether that state is formed before or after a policy a created. Based on the emphasis on order, it feels like perhaps what you're looking for is a change-based formalism, where policy is evaluated on the change to state? For example, a state-based policy may say: if it *is* raining, make sure umbrella is used. A change-based policy may say: if it *starts* raining, deploy umbrella. Generally speaking, state-based formalism leads to simpler and more robust policies, but change-based formalism allows for greater control. But the use of one formalism does not necessarily preclude the other.
Which can be harmful by keeping in mind that old alarms that are INVALID at present are still stored in Congress tables. So the user can trigger FALSE action based on that invalid alarm which can be very harmful to the environment.
Just to clarify for someone coming to the discussion: under normal operations, alarms which have become inactive are also accurately reflected in Congress. Of course, as with any distributed system, there are issues with delivery and latency and timing. So we want to make sure Congress offers the right facilities in its policy formalism to enable policy writers to write robust policies that avoid unintended behaviors. (More details in the discussion in the quoted emails.)
In order to tackle this, there can be multiple ways from the perspective of every OpenStack project handling alarms. One of the solutions can be: As action needs to be taken immediately after the alarm is raised, so storing only those alarms that have corresponding actions or policies(that will use the alarm) and after the policy is executed on them just discard those alarms or mark those alarm with some field like old, executed, etc. Or there are use cases that require old alarms?
Also, we need to provide Operator the ability to delete the rows in congress datasource table. This will not completely help in solving this issue but still, it's better functionality to have IMO.
Above solution or any discussed better solution can lead to change in mechanism i.e currently followed that involves policy execution on both new alarm and existing alarm to only new alarm.
I have added the previous discussion below and discussion in Congress weekly IRC meeting can be found here http://eavesdrop.openstack.org/meetings/congressteammeeting/2019/congresstea...
Thanks and regards, Akhil ________________________________________ From: Eric K <ekcs.openstack@gmail.com> Sent: Tuesday, February 19, 2019 11:04 AM To: AKHIL Jain Subject: Re: Congress Demo and Output
Thanks for the update!
Yes of course if created_at field is needed by important use case then please feel free to add it! Sample policy in the commit message would be very helpful.
Regarding old alarms, I need a couple clarifications: First, which categories of actions executions are we concerned about? 1. Actions executed automatically by congress policy. 2. Actions executed automatically by another service getting data from Congress. 3. Actions executed manually by operator based on data from Congress.
Second, let's clarify exactly what we mean by "old". There are several categories I can think of: 1. Alarms which had been activated and then deactivated. 2. Alarms which had been activated and remains active, but it has been some time since it first became active. 3. Alarms which had been activated and triggered some action, but the alarm remains active because the action do not resolve the alarm. 4. Alarms which had been activated and triggered some action, and the action is in the process of resolving the alarm, but in the mean time the alarm remains active.
(1) should generally not show up in Congress as active in push update case, but there are failure scenarios in which an update to deactivate can fail to reach Congress. (2) seems to be the thing option 1.1 would get rid of. But I am not clear what problems (2) causes. Why is a bad idea to execute actions based on an alarm that has been active for some time and remains active? An example would help me =)
I can see (4) causing problems. But I'd like to work through an example to understand more concretely. In simple cases, Congress policy action execution behavior actually works well.
If we have simple case like: execute[action(1)] :- alarm(1) Then action(1) is not going to be executed twice by congress because the behavior is that Congress executes only the NEWLY COMPUTED actions.
If we have a more complex case like: execute[action(1)] :- alarm(1)
execute[action(2)] :- alarm(1), alarm(2) If alarm (1) activates first, triggering action(1), then alarm (2) activates before alarm(1) deactivates, action(2) would be triggered because it is newly computed. Whether we WANT it executed may depend on the use case.
And I'd also like to add option 1.3: Add a new table in (say monasca) called latest_alarm, which is the same as the current alarms table, except that it contains only the most recently received active alarm. That way, the policies which must avoid using older alarms can refer to the latest_alarm table. Whereas policies which would consider all currently active alarms can refer to the alarms table.
Looking forward to more discussion!
On 2/17/19, 10:44 PM, "AKHIL Jain" <akhil.jain@india.nec.com> wrote:
Hi Eric,
There are some questions raised while working on FaultManagement usecase, mainly below ones: 1. Keeping old alarms can be very harmful, the operator can execute actions based on alarms that are not even existing or valid. 2. Adding a created_at field in Nova servers table can be useful.
So for the first question, there can be multiple options: 1.1 Do not store those alarms that do not have any policy created in Congress to execute on that alarm 1.2 Add field in alarm that can tell if the policy is executed using that row or not. And giving the operator a command to delete them or automatically delete them.
For 2nd question please tell me that its good to go and I will add it.
Regards Akhil
To facilitate further discussion, I have begun an etherpad [1] to write out in more detail the cases to consider as well as the desired behaviors and potential solutions. Feel free to add/elaborate/correct the cases! [1] https://etherpad.openstack.org/p/congress-exec-semantics-cases On Mon, Feb 25, 2019 at 3:57 PM Eric K <ekcs.openstack@gmail.com> wrote:
On Sun, Feb 24, 2019 at 7:14 PM AKHIL Jain <akhil.jain@india.nec.com> wrote:
Hi all,
This discussion is about keeping, managing and executing actions based on old alarms.
In Congress, when the policy is created the corresponding actions are executed based on data already existing in datasource tables and on the data that is received later in Congress datasource tables. So the alarms raised by projects like aodh, monasca are polled by congress and even the webhook notifications for alarm are received and stored in congress. In Congress, there are two scenarios of policy execution. One, execution based on data already existing before the policy is created and second, policy is created and action is executed at any time after the data is received
Fundamentally the current policy formalism is based on state. Policy is evaluated on the latest state, whether that state is formed before or after a policy a created. Based on the emphasis on order, it feels like perhaps what you're looking for is a change-based formalism, where policy is evaluated on the change to state? For example, a state-based policy may say: if it *is* raining, make sure umbrella is used. A change-based policy may say: if it *starts* raining, deploy umbrella. Generally speaking, state-based formalism leads to simpler and more robust policies, but change-based formalism allows for greater control. But the use of one formalism does not necessarily preclude the other.
Which can be harmful by keeping in mind that old alarms that are INVALID at present are still stored in Congress tables. So the user can trigger FALSE action based on that invalid alarm which can be very harmful to the environment.
Just to clarify for someone coming to the discussion: under normal operations, alarms which have become inactive are also accurately reflected in Congress. Of course, as with any distributed system, there are issues with delivery and latency and timing. So we want to make sure Congress offers the right facilities in its policy formalism to enable policy writers to write robust policies that avoid unintended behaviors. (More details in the discussion in the quoted emails.)
In order to tackle this, there can be multiple ways from the perspective of every OpenStack project handling alarms. One of the solutions can be: As action needs to be taken immediately after the alarm is raised, so storing only those alarms that have corresponding actions or policies(that will use the alarm) and after the policy is executed on them just discard those alarms or mark those alarm with some field like old, executed, etc. Or there are use cases that require old alarms?
Also, we need to provide Operator the ability to delete the rows in congress datasource table. This will not completely help in solving this issue but still, it's better functionality to have IMO.
Above solution or any discussed better solution can lead to change in mechanism i.e currently followed that involves policy execution on both new alarm and existing alarm to only new alarm.
I have added the previous discussion below and discussion in Congress weekly IRC meeting can be found here http://eavesdrop.openstack.org/meetings/congressteammeeting/2019/congresstea...
Thanks and regards, Akhil ________________________________________ From: Eric K <ekcs.openstack@gmail.com> Sent: Tuesday, February 19, 2019 11:04 AM To: AKHIL Jain Subject: Re: Congress Demo and Output
Thanks for the update!
Yes of course if created_at field is needed by important use case then please feel free to add it! Sample policy in the commit message would be very helpful.
Regarding old alarms, I need a couple clarifications: First, which categories of actions executions are we concerned about? 1. Actions executed automatically by congress policy. 2. Actions executed automatically by another service getting data from Congress. 3. Actions executed manually by operator based on data from Congress.
Second, let's clarify exactly what we mean by "old". There are several categories I can think of: 1. Alarms which had been activated and then deactivated. 2. Alarms which had been activated and remains active, but it has been some time since it first became active. 3. Alarms which had been activated and triggered some action, but the alarm remains active because the action do not resolve the alarm. 4. Alarms which had been activated and triggered some action, and the action is in the process of resolving the alarm, but in the mean time the alarm remains active.
(1) should generally not show up in Congress as active in push update case, but there are failure scenarios in which an update to deactivate can fail to reach Congress. (2) seems to be the thing option 1.1 would get rid of. But I am not clear what problems (2) causes. Why is a bad idea to execute actions based on an alarm that has been active for some time and remains active? An example would help me =)
I can see (4) causing problems. But I'd like to work through an example to understand more concretely. In simple cases, Congress policy action execution behavior actually works well.
If we have simple case like: execute[action(1)] :- alarm(1) Then action(1) is not going to be executed twice by congress because the behavior is that Congress executes only the NEWLY COMPUTED actions.
If we have a more complex case like: execute[action(1)] :- alarm(1)
execute[action(2)] :- alarm(1), alarm(2) If alarm (1) activates first, triggering action(1), then alarm (2) activates before alarm(1) deactivates, action(2) would be triggered because it is newly computed. Whether we WANT it executed may depend on the use case.
And I'd also like to add option 1.3: Add a new table in (say monasca) called latest_alarm, which is the same as the current alarms table, except that it contains only the most recently received active alarm. That way, the policies which must avoid using older alarms can refer to the latest_alarm table. Whereas policies which would consider all currently active alarms can refer to the alarms table.
Looking forward to more discussion!
On 2/17/19, 10:44 PM, "AKHIL Jain" <akhil.jain@india.nec.com> wrote:
Hi Eric,
There are some questions raised while working on FaultManagement usecase, mainly below ones: 1. Keeping old alarms can be very harmful, the operator can execute actions based on alarms that are not even existing or valid. 2. Adding a created_at field in Nova servers table can be useful.
So for the first question, there can be multiple options: 1.1 Do not store those alarms that do not have any policy created in Congress to execute on that alarm 1.2 Add field in alarm that can tell if the policy is executed using that row or not. And giving the operator a command to delete them or automatically delete them.
For 2nd question please tell me that its good to go and I will add it.
Regards Akhil
participants (2)
-
AKHIL Jain
-
Eric K