[openstack-dev] [AODH] event-alarm timeout discussion

gordon chung gord at live.ca
Wed Sep 21 20:01:12 UTC 2016



On 21/09/16 01:43 AM, Zhai, Edwin wrote:
> All,
>
> I'd like make some clarification for the event-alarm timeout design as
> many of you have some misunderstanding here. Pls. correct me if any
> mistakes.
>
> I realized that there are 2 different things, but we mix them sometime:
> 1. event-timeout-alarm
> This is one new type of alarm that bracket *.start and *.end events and
> get alarmed when receive *.start but no *.end in timeout. This new alarm
> handles one type of events/actions, e.g. create one alarm for instance
> creation, then all instances created in future will be handled by this
> alarm. This is not for real time, so it's acceptable that user know one
> instance creation failure in 5 mins.
>
> This new type of alarm can be implemented by one worker to check the DB
> periodically to do the statistic work. That is, new evaluator works in
> 'polling' mode, something like threshold alarm evaluator.
>
> One BP is @
> https://review.openstack.org/#/c/199005/

we should probably disregard this bp since it was assumed you guys 
talked over it. i'm abandoning it as i think we just forgot about it.

>
> 2. event-alarm timeout
> This is one new feature for _existed_ event-alarm evaluator. One alarm
> becomes 'UNALARM' when not receive desire event in timeout. This feature
> just handles one specific event, e.g create one alarm for instance ABC's
> XYZ operation with 5s, then user is notified in 5s immediately if no
> XYZ.done event comes. If want check for another instance, we need create
> another alarm.
>
> This is used in telco scenario, where operator want know if operation
> failure in real time.
>
> My patch(https://review.openstack.org/#/c/272028/) is for this purpose
> only, but I feel many guys mistaken them(sometimes even me) as they
> looks similar. So my question is: Do you think this telco usage model of
> event-alarm timeout is valid? If not, we can avoid discussing its
> implementation and ignore following.
>
>
> =========== event-alarm timeout implementation =============
> As it's for event-alarm, we need keep it as event-driven. Furthermore,
> for quick response, we need use event for timeout handling. Periodic
> worker can't meet real time requirement.
>
> Separated queue for 'alarm.timeout.end'(indicates timeout expire) leads
> tricky race condition.  e.g.  'XYZ.done' comes in queue1, and
> 'alarm.timeout.end' comes in queue2, so that they are handled in
> parallel way:
>
> 1. In queue1, 'XYZ.done' is checking against alarm(current UNKNOWN), and
> will be set ALARM in next step.
> 2. In queue2, 'alarm.timeout.end' is checking against same alarm(current
> UNKNOWN), and will be set to OK(UNALARM) in next step.
> 3. In qeueu1, alarm transition happen: UNKNOWN => ALARM
> 4. In queue2, another alarm transition happen: ALARM =>OK(UNALARM)
>


can you clarify how this work? after user creates event timeout alarm 
definition through API (i assume the alarm definition specify we should 
see event x within y seconds).
- how does the evaluator get this alarm definition? is there an 
alarm.timeout.start message?
- what is this UNALARM state? to be honest, that isn't a real word so i 
don't know what it's suppose to represent here.

biggest problem for me is the only thing i know is there's a 
alarm.timeout.end event that needs to be handled by evaluator. i don't 
know where it's coming from or what it's needed for.


> So this alarm has bogus transition: UNKNOWN=>ALARM=>UNALARM, and tells
> the user: required event came, then no required event came;
>
> If put all events in one queue, evaluator handles them one by one(low
> level oslo mesg should be multi-threaded) so that second event would see
> alarm state as not UNKNOWN, and give up its transition.  As Gordc said,
> it's slow. But only very small part of the event-alarm need timeout
> handling, as it's only for telco usage model.

so the multithreaded part is what i was talking about. it's not handling 
them one by one. it's handling 64 (or whatever the default is) at any 
given time. whether its' one queue or two, you have a race to handle.

>
> One possible improvement as JD pointed out is to avoid so many spawned
> thread. We can just create one thread inside evaluator, and ask this
> thread handle all timeout requests from evaluator. Is it acceptable for
> event-alarm timeout solution?
>
>
> Best Rgds,
> Edwin

-- 
gord



More information about the OpenStack-dev mailing list