[openstack-dev] [Synaps] potential false positives in monitor threshold evaluation

Eoghan Glynn eglynn at redhat.com
Mon Oct 22 19:28:40 UTC 2012



Hi Synaps Folks,

I wanted to get the ball rolling on community discussion, by bringing
up an aspect of the synapse monitor threshold evaluation approach that
I suspect may be problematic in some respects.

TL;DR: the monitor evaluation scheme may be susceptible to false
positives due to outlier datapoints being considered eagerly.

Of course I'm open to correction if I've misunderstood the code, or
if you've avoided this potential issue via some side-effect (e.g. by
truncating timestamps).

So, the issue is around the point in time when threshold evaluations
are triggered. IIUC your approach is what I would call "in-stream
evaluation", i.e. the receipt of a metric datapoint triggers the
immediate evaluation of the monitors associated with that metric.

Contrast with the approach taken by CloudWatch, where metric periods
are clamped to one-minute wall-clock boundaries. So in that case,
metrics received during the current minute are not available for
consideration until at least the start of the next wall-clock minute.

Now, this would seem to suggest that Synaps would produce more timely
notifications, i.e. alarms would fire closer to the receipt of the
last abnormal datapoint.

However, the problem is that the eager approach also makes false
positives far more likely, due to outlier datapoints being considered
before the aggregates can be leavened by the receipt of associated
(potentially less atypical) datapoints within the same period.

Lets consider an example, say we're watching CPU util over an
autoscaling group with 60s period, triggering an up-scale if the
average goes above some threshold.

Now we'd expect some variation in CPU util across the group, due to
randomness in the LB strategy (e.g. one instance being hammered with
more heavy-weight requests for a short period).

With delayed aggregation, most or all of the autoscaling group
instances would have had a chance to report their CPU util before the
average is taken for the previous minute.

However with the in-stream evaluation model, IIUC the average would be
considered when the *first* datapoint is received. If this happened to
be a high outlier from the randomly hammered instance, the average
would skew upward, i.e. be momentarily higher than it would be if all
or most datapoints for that period were considered in the aggregate.

If this potential issue is confirmed, then there would be several
approaches to addressing it, including:

- switch to a periodic delayed aggregation model, whereby datapoints
  are not considered until the end of the wall-clock minute in which
  they were received

- apply some adaptive quorum logic, whereby datapoints do not trigger
  state transitions until a sufficient number of samples (relative to
  the observed trend) have been received for the period under
  consideration

Of course delayed aggregation is not infallible either - it just makes
false positives less likely, but doesn't necessarily eliminate them in
the presence of slow reporters (so may need additional quorum-style
filtering).

Another advantage of periodic out-of-stream evaulation is that the
INSUFFICIENT_DATA evaluation becomes more symmetric - i.e. it can be
driven from one place, instead of needing to check for
INSUFFICIENT_DATA *both* in-stream & also periodically (in order to
catch the case when the metric stream comes to a hard stop).

Thoughts?

Cheers,
Eoghan



More information about the OpenStack-dev mailing list