[openstack-dev] [Synaps] potential false positives in monitor threshold evaluation

Eoghan Glynn eglynn at redhat.com
Tue Oct 23 10:50:31 UTC 2012


> Currently, if users of Synaps want to avoid false
> positives, their alarms should have periods longer
> than 60 or evalutaion-periods longer than 1.

True, using long periods or multiple evaluation periods would make
false positives less likely.

However, these wouldn't be totally eliminated, as you could still have
[evaluation_periods - 1] periods above threshold, followed a period
with the earliest datapoint also above threshold, but with subsequent
datapoints far enough below so as to bring the true average for the
period back below threshold (so that the alarm would not fire with
delayed aggregation).

 
> I think adapting periodic out-of-stream evaluation
> cloud be added in our backlog. But I still have no
> idea how to spread the load.

IIUC currently the workload is partitioned across the pool of
storm workers by metric key. So a worker handles a group of metrics
and their associated monitors, right? (i.e. a field grouping based
on metric key)

Now would a simple way of switching to periodic out-of-stream evaluation
just be to only do threshold evaluation based on the periodic heartbeat
as opposed to the receipt of metric data?

i.e. only trigger the put_metric_bolt.MetricMonitor.check_alarms() call
on receipt of CHECK_METRIC_ALARM_MSG but not PUT_METRIC_DATA_MSG.

Or is that over-simplifying?

Just in terms of a fair distribution of work, seems that there could
be pathological cases that would lead to an unbalanced partitioning.
Say for example where certain metrics are watched by many monitors,
whereas other metrics aren't watched at all. So another scenario
would be for a grouping among threshold evaluation bolts to be based
on monitor ID as opposed to metric ID (which would probably give a
fairer distribution or work, but at the cost of losing the in-memory
metrics).

Cheers,
Eoghan




More information about the OpenStack-dev mailing list