Open Stack

Thu Oct 25 18:44:35 UTC 2012

Thanks for the response June Yi,

> I could be wrong, but I guess that AWS CW might evaluate in batch
> processing so that it could introduce unexpected leeway.

Well one can do a bit of experimentation with some CW alarms
based on custom metrics, and see that state transitions to ALARM
are generally quite timely (e.g. occur within at most a minute
after pushing the last datapoint required to make the threshold
condition true). 

Similarly, transitions to OK can be seen to occur soon after the
first datapoint that would make the threshold condition false.

Whereas transitions into INSUFFICIENT_DATA are only seen after a
significant lag, longer than [eval periods * period length] mins
without any datapoints what-so-ever.

The opposite is true for transitions out of INSUFFICIENT_DATA,
which can be observed to occur eagerly, on the first datapoint
encountered after a gap.

These observations don't suggest the lag is induced by batching up
evaluations, as it would be symmetric for different state transitions
in that case. Rather it seems more like deliberate sensitivity
tuning to avoid flapping and to minimize the length of time any
monitor spends in the unknown state.

> I prefer strict interpretation of INSUFFICIENT_DATA so that we can
> predict behavior of the system. For large production deployments,
> they should use long enough period of alarm.

I guess the thing to remember is that this is a user-facing API,
so in a production deployment we must expect and tolerate a mix
of different alarming strategies ... i.e. the cloud provider can't
dictate that users prefer monitors with long periods over short
(and/or many evaluation periods over few).

If Synaps is on a hair-trigger with regard to INSUFFICIENT_DATA
transitions, it will add to its own notification workload when
short term metric-stream delays occur (possibly making it harder
to recover from the original problem).

Also the behavior can still be predictable, as long as the leeway
is clearly quantified.

> And, I agree that current Synaps concept of evaluation period is
> useful but making confusion. So I'll start to make it align with AWS
> CW's concept. And switching to out-of-stream evaluation can make
> Synaps evaluate lesser and will be helpful to reduce flapping.

Would you be interested in making those changes collaboratively?

It would be a good learning experience for me, and also perhaps a
useful exercise in community engagement for you guys.

Just a thought, let me know.

Cheers,
Eoghan

Open Stack

[openstack-dev] [Synaps] a more forgiving interpretation of INSUFFICIENT_DATA?

OpenStack

Community

Documentation

Branding & Legal