[openstack-dev] [Ceilometer] Approximate alarming

Eoghan Glynn eglynn at redhat.com
Mon Apr 28 11:42:30 UTC 2014



----- Original Message -----
> Hey everyone!
> 
> I’d like to get your gut reaction on an idea for the future of alarming.
> Should I or should I not put it up for debate at the design summit?

Hi Nejc,

Yes this is certainly worthy of discussion at the design summit.

Because the algorithms being discussed are quite technical, it
would be good to prepare an etherpad in advance with a gentle
introduction to the ideas involved, e.g. some worked examples etc.

This would make for a more profitable discussion at summit, where
you don't have to burn too much time explaining the intricacies of
the streaming approaches.

Cheers,
Eoghan


> ---TL;DR
> Online algorithms for computing stream statistics over sliding windows would
> allow us to provide sample statistics within an error bound (e.g. "The
> average cpu utilization in the last hour was 85% +/- 1%”), while
> significantly reducing the load and memory requirements of the computation.
>> 
> Alarm evaluation currently recalculates the aggregate values each time the
> alarm is evaluated, which is problematic because of the load it puts on the
> system. There have been multiple ideas on how to solve this problem, from
> precalculating aggregate values
> (https://wiki.openstack.org/wiki/Ceilometer/Alerting#Precalculation_of_aggregate_values)
> to re-architecting the alarms into the sample pipeline
> (https://wiki.openstack.org/wiki/Ceilometer/AlarmImprovements). While
> Sandy's suggestions make sense from the performance viewpoint, the problem
> of scalability remains. Samples in the pipeline need to be kept in-memory
> for the whole evaluation window, which requires O(N) memory for a window of
> size N.
> 
> We could tackle this problem by using cutting edge research in streaming
> algorithms, namely the papers by Datar et al. [1], and Arasu et al. [2].
> They provide algorithms for computing stream statistics over sliding
> windows, such as *count, avg, min, max* and even *percentile*, **online**
> and with polylogarithmic space requirements. The tradeoff is of course
> precision, but the algorithms are bounded on the relative error - which
> could be specified by the user.
> 
> If we can tell the user "The average cpu utilization in the last hour was 85%
> +/- 1%", would that not be enough for most use cases, while severely
> reducing the load on the system? We could still support *error_rate=0*,
> which would simply use O(N) space and provide a precise answer for the cases
> where such an answer is needed.
> 
> These algorithms were developed with telcos and computer network monitoring
> in mind, "in which information about current network performance—latency,
> bandwidth, etc.—is generated online and is used to monitor and adjust
> network performance dynamically"[1]. IIUC the main user of alarms is Heat
> autoscaling, which is exactly the kind of problem suitable to 'soft'
> calculations, with a certain tolerance for error.
> 
> [1] Datar, Mayur, et al. "Maintaining stream statistics over sliding
> windows." *SIAM Journal on Computing* 31.6 (2002): 1794-1813. PDF @
> http://ilpubs.stanford.edu:8090/504/1/2001-34.pdf
> 
> [2] Arasu, Arvind, and Gurmeet Singh Manku. "Approximate counts and quantiles
> over sliding windows." *Proceedings of the twenty-third ACM
> SIGMOD-SIGACT-SIGART symposium on Principles of database systems.* ACM,
> 2004. PDF @ http://ilpubs.stanford.edu:8090/624/1/2003-72.pdf
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 



More information about the OpenStack-dev mailing list