[openstack-dev] [AODH] event-alarm timeout discussion

Zhai, Edwin edwin.zhai at intel.com
Wed Sep 21 05:43:17 UTC 2016


All,

I'd like make some clarification for the event-alarm timeout design as many of 
you have some misunderstanding here. Pls. correct me if any mistakes.

I realized that there are 2 different things, but we mix them sometime:
1. event-timeout-alarm
This is one new type of alarm that bracket *.start and *.end events and get 
alarmed when receive *.start but no *.end in timeout. This new alarm handles one 
type of events/actions, e.g. create one alarm for instance creation, then all 
instances created in future will be handled by this alarm. This is not for real 
time, so it's acceptable that user know one instance creation failure in 5 mins.

This new type of alarm can be implemented by one worker to check the DB 
periodically to do the statistic work. That is, new evaluator works in 'polling' 
mode, something like threshold alarm evaluator.

One BP is @
https://review.openstack.org/#/c/199005/

2. event-alarm timeout
This is one new feature for _existed_ event-alarm evaluator. One alarm becomes 
'UNALARM' when not receive desire event in timeout. This feature just handles 
one specific event, e.g create one alarm for instance ABC's XYZ operation with 
5s, then user is notified in 5s immediately if no XYZ.done event comes. If want 
check for another instance, we need create another alarm.

This is used in telco scenario, where operator want know if operation failure in 
real time.

My patch(https://review.openstack.org/#/c/272028/) is for this purpose only, but 
I feel many guys mistaken them(sometimes even me) as they looks similar. So my 
question is: Do you think this telco usage model of event-alarm timeout is 
valid? If not, we can avoid discussing its implementation and ignore following.


=========== event-alarm timeout implementation =============
As it's for event-alarm, we need keep it as event-driven. Furthermore, for quick 
response, we need use event for timeout handling. Periodic worker can't meet 
real time requirement.

Separated queue for 'alarm.timeout.end'(indicates timeout expire) leads tricky 
race condition.  e.g.  'XYZ.done' comes in queue1, and 'alarm.timeout.end' comes 
in queue2, so that they are handled in parallel way:

1. In queue1, 'XYZ.done' is checking against alarm(current UNKNOWN), and will be 
set ALARM in next step.
2. In queue2, 'alarm.timeout.end' is checking against same alarm(current 
UNKNOWN), and will be set to OK(UNALARM) in next step.
3. In qeueu1, alarm transition happen: UNKNOWN => ALARM
4. In queue2, another alarm transition happen: ALARM =>OK(UNALARM)

So this alarm has bogus transition: UNKNOWN=>ALARM=>UNALARM, and tells the user: 
required event came, then no required event came;

If put all events in one queue, evaluator handles them one by one(low level oslo 
mesg should be multi-threaded) so that second event would see alarm state as not 
UNKNOWN, and give up its transition.  As Gordc said, it's slow. But only very 
small part of the event-alarm need timeout handling, as it's only for telco 
usage model.

One possible improvement as JD pointed out is to avoid so many spawned thread. 
We can just create one thread inside evaluator, and ask this thread handle all 
timeout requests from evaluator. Is it acceptable for event-alarm timeout 
solution?


Best Rgds,
Edwin



More information about the OpenStack-dev mailing list