[openstack-dev] [AODH] event-alarm timeout discussion
Zhai, Edwin
edwin.zhai at intel.com
Wed Sep 21 05:43:17 UTC 2016
All,
I'd like make some clarification for the event-alarm timeout design as many of
you have some misunderstanding here. Pls. correct me if any mistakes.
I realized that there are 2 different things, but we mix them sometime:
1. event-timeout-alarm
This is one new type of alarm that bracket *.start and *.end events and get
alarmed when receive *.start but no *.end in timeout. This new alarm handles one
type of events/actions, e.g. create one alarm for instance creation, then all
instances created in future will be handled by this alarm. This is not for real
time, so it's acceptable that user know one instance creation failure in 5 mins.
This new type of alarm can be implemented by one worker to check the DB
periodically to do the statistic work. That is, new evaluator works in 'polling'
mode, something like threshold alarm evaluator.
One BP is @
https://review.openstack.org/#/c/199005/
2. event-alarm timeout
This is one new feature for _existed_ event-alarm evaluator. One alarm becomes
'UNALARM' when not receive desire event in timeout. This feature just handles
one specific event, e.g create one alarm for instance ABC's XYZ operation with
5s, then user is notified in 5s immediately if no XYZ.done event comes. If want
check for another instance, we need create another alarm.
This is used in telco scenario, where operator want know if operation failure in
real time.
My patch(https://review.openstack.org/#/c/272028/) is for this purpose only, but
I feel many guys mistaken them(sometimes even me) as they looks similar. So my
question is: Do you think this telco usage model of event-alarm timeout is
valid? If not, we can avoid discussing its implementation and ignore following.
=========== event-alarm timeout implementation =============
As it's for event-alarm, we need keep it as event-driven. Furthermore, for quick
response, we need use event for timeout handling. Periodic worker can't meet
real time requirement.
Separated queue for 'alarm.timeout.end'(indicates timeout expire) leads tricky
race condition. e.g. 'XYZ.done' comes in queue1, and 'alarm.timeout.end' comes
in queue2, so that they are handled in parallel way:
1. In queue1, 'XYZ.done' is checking against alarm(current UNKNOWN), and will be
set ALARM in next step.
2. In queue2, 'alarm.timeout.end' is checking against same alarm(current
UNKNOWN), and will be set to OK(UNALARM) in next step.
3. In qeueu1, alarm transition happen: UNKNOWN => ALARM
4. In queue2, another alarm transition happen: ALARM =>OK(UNALARM)
So this alarm has bogus transition: UNKNOWN=>ALARM=>UNALARM, and tells the user:
required event came, then no required event came;
If put all events in one queue, evaluator handles them one by one(low level oslo
mesg should be multi-threaded) so that second event would see alarm state as not
UNKNOWN, and give up its transition. As Gordc said, it's slow. But only very
small part of the event-alarm need timeout handling, as it's only for telco
usage model.
One possible improvement as JD pointed out is to avoid so many spawned thread.
We can just create one thread inside evaluator, and ask this thread handle all
timeout requests from evaluator. Is it acceptable for event-alarm timeout
solution?
Best Rgds,
Edwin
More information about the OpenStack-dev
mailing list