[openstack-dev] [Ceilometer] Approximate alarming

Nejc Saje nejc.saje at xlab.si
Thu Apr 17 15:52:47 UTC 2014


Hey everyone!

I’d like to get your gut reaction on an idea for the future of alarming. Should I or should I not put it up for debate at the design summit?

---TL;DR
Online algorithms for computing stream statistics over sliding windows would allow us to provide sample statistics within an error bound (e.g. "The average cpu utilization in the last hour was 85% +/- 1%”), while significantly reducing the load and memory requirements of the computation.
—

Alarm evaluation currently recalculates the aggregate values each time the alarm is evaluated, which is problematic because of the load it puts on the system. There have been multiple ideas on how to solve this problem, from precalculating aggregate values (https://wiki.openstack.org/wiki/Ceilometer/Alerting#Precalculation_of_aggregate_values) to re-architecting the alarms into the sample pipeline (https://wiki.openstack.org/wiki/Ceilometer/AlarmImprovements). While Sandy's suggestions make sense from the performance viewpoint, the problem of scalability remains. Samples in the pipeline need to be kept in-memory for the whole evaluation window, which requires O(N) memory for a window of size N.

We could tackle this problem by using cutting edge research in streaming algorithms, namely the papers by Datar et al. [1], and Arasu et al. [2]. They provide algorithms for computing stream statistics over sliding windows, such as *count, avg, min, max* and even *percentile*, **online** and with polylogarithmic space requirements. The tradeoff is of course precision, but the algorithms are bounded on the relative error - which could be specified by the user.

If we can tell the user "The average cpu utilization in the last hour was 85% +/- 1%", would that not be enough for most use cases, while severely reducing the load on the system? We could still support *error_rate=0*, which would simply use O(N) space and provide a precise answer for the cases where such an answer is needed.

These algorithms were developed with telcos and computer network monitoring in mind, "in which information about current network performance—latency, bandwidth, etc.—is generated online and is used to monitor and adjust network performance dynamically"[1]. IIUC the main user of alarms is Heat autoscaling, which is exactly the kind of problem suitable to 'soft' calculations, with a certain tolerance for error.

[1] Datar, Mayur, et al. "Maintaining stream statistics over sliding windows." *SIAM Journal on Computing* 31.6 (2002): 1794-1813. PDF @ http://ilpubs.stanford.edu:8090/504/1/2001-34.pdf

[2] Arasu, Arvind, and Gurmeet Singh Manku. "Approximate counts and quantiles over sliding windows." *Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems.* ACM, 2004. PDF @ http://ilpubs.stanford.edu:8090/624/1/2003-72.pdf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 496 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140417/e3e21658/attachment.pgp>


More information about the OpenStack-dev mailing list