[openstack-dev] [AODH] event-alarm timeout discussion
Zhai, Edwin
edwin.zhai at intel.com
Thu Sep 22 06:40:26 UTC 2016
Gordon,
Thanks for your comments.
Pls. check my answer and flow chart below.
On Wed, 21 Sep 2016, gordon chung wrote:
>>
>> =========== event-alarm timeout implementation =============
>> As it's for event-alarm, we need keep it as event-driven. Furthermore,
>> for quick response, we need use event for timeout handling. Periodic
>> worker can't meet real time requirement.
>>
>> Separated queue for 'alarm.timeout.end'(indicates timeout expire) leads
>> tricky race condition. e.g. 'XYZ.done' comes in queue1, and
>> 'alarm.timeout.end' comes in queue2, so that they are handled in
>> parallel way:
>>
>> 1. In queue1, 'XYZ.done' is checking against alarm(current UNKNOWN), and
>> will be set ALARM in next step.
>> 2. In queue2, 'alarm.timeout.end' is checking against same alarm(current
>> UNKNOWN), and will be set to OK(UNALARM) in next step.
>> 3. In qeueu1, alarm transition happen: UNKNOWN => ALARM
>> 4. In queue2, another alarm transition happen: ALARM =>OK(UNALARM)
>>
>
>
> can you clarify how this work? after user creates event timeout alarm
> definition through API (i assume the alarm definition specify we should
> see event x within y seconds).
> - how does the evaluator get this alarm definition? is there an
> alarm.timeout.start message?
Yes.
> - what is this UNALARM state? to be honest, that isn't a real word so i
> don't know what it's suppose to represent here.
It's OK - mean we have enough data to say: not trigger this alarm. Somebody
mistaken it by ALARM, so I mark it as UN-ALARM.
>
> biggest problem for me is the only thing i know is there's a
> alarm.timeout.end event that needs to be handled by evaluator. i don't
> know where it's coming from or what it's needed for.
I attached flow chart at bottom . pls. check it. 'timeout.end' and event 'X'
comes in different ways, it's good if evaluator do not touch next one until
previous one was handled.
>
>
>> So this alarm has bogus transition: UNKNOWN=>ALARM=>UNALARM, and tells
>> the user: required event came, then no required event came;
>>
>> If put all events in one queue, evaluator handles them one by one(low
>> level oslo mesg should be multi-threaded) so that second event would see
>> alarm state as not UNKNOWN, and give up its transition. As Gordc said,
>> it's slow. But only very small part of the event-alarm need timeout
>> handling, as it's only for telco usage model.
>
> so the multithreaded part is what i was talking about. it's not handling
> them one by one. it's handling 64 (or whatever the default is) at any
> given time. whether its' one queue or two, you have a race to handle.
See https://github.com/openstack/aodh/blob/master/aodh/evaluator/event.py#L158
evaluate_events is the handler of the endpoint for 'alarm.all', it iterates the
event list and evaluate them one by one with project alarms. If both
'timeout.end' and 'X' are in the event list, I assume they are handled in
sequence at different iterations of for loop. Am I right?
If we have evaluate_timeout_events as handler of another endpoint for
'alarm.timeout', then 2 handlers can run concurrently to lead race condition.
I'm not familiar with underline oslo notifications, and think separated queue is
different story. Pls. correct me if I'm wrong.
for e in events:
try:
event = Event(e)
......
for id, alarm in six.iteritems(
self._get_project_alarms(event.project)):
try:
self._evaluate_alarm(alarm, event)
...
================================================================================
+----------+ +------------+ +------------------+ +------------+ +-----------+ +------------+
| User | | API server | | Notification bus | | Evaluator | | Threads | | Alarm state|
+--+-------+ +-----+------+ +--------+---------+ +-----+------+ +--------+--+ +------+-----+
| | | | | |
+---------------> | | | | |
| +-------------+ | | | | |
| |Alarm create | | | | | +-----------+
| |event: X | | | | | | UNKNOWN |
| |timeout: 5s | | | | | +-----------+
| +-------------+ | | | | |
| +-----------------------------------------> | | |
| | +-----------------+ | | | |
| | |Event sent: | | | | |
| | |tiemout.start | | | | |
| | +-----------------+ | | | |
| | | +--------------------> | |
| | | | +----------+ | |
| | | | | create | | |
| | | | +----------+ | |
| | | | +-----------+ |
| | | | |Sleep 10s | |
| | | | +-----------+ |
| | | | | |
| | | | <--------------------+ |
| | | | +-----------------+ | |
| | | | |1 - Event sent: | | |
| | | | |timeout.end | | |
| | | | +-----------------+ | |
| | | | | |
| | | +-----------------------------------------> |
| | | | +------------------+ | +--------+
| | | | |Transition: | | | OK |
| | | | |==>> OK | | +--------+
| | | | +------------------+ | |
| | +-------------------> | | |
| | | +---------------+ | | |
| | | |2 - Event come:| | | |
| | | |X | | | |
| | | +---------------+ | +------------------+ | |
| | | | |No transition: | | |
+ + + + |Already OK | + +
+------------------+
More information about the OpenStack-dev
mailing list