[openstack-dev] [ceilometer] The reset on the cumulative counter

Eoghan Glynn eglynn at redhat.com
Mon Nov 26 17:11:27 UTC 2012



> 	Eglynn and I discussed about how to handle the cumulative counter on
> 	IRC and we want to continue the discussion in the ML to get more
> 	feedback. The IRC log is in http://pastebin.com/byKvhNcJ. Eglynn,
> 	please correct me if I make any mistake.
> 
> 	The issue is the cpu_info[cpu_time] is reset to 0 when a domain is
> 	suspend/resume. There are already some discussion on it already,
> 	and several potential solution raised.
> 	Jd discussed this firstly in
> 	https://bugs.launchpad.net/ceilometer/+bug/1061817 and then it's
> 	discussed in ML at
> 	http://www.mail-archive.com/openstack@lists.launchpad.net/msg17781.html.
> 	
> 	There are several method on this, like JD's MAX()- FIRST(), or
> 	asalkeld's idea of store the offset. Both try to detect if there is
> 	reset happen by checking if there is counter reverse. (again,
> 	Eglynn, please correct me if I'm wrong).
> 
> 	However, during the discussion, we think that if the suspend/resume
> 	cycle is very rapid, and the ceilometer polling frequency is set up
> 	so that the stats appear to be in a monotonic sequence, then
> 	ceilometer will failed to detect the reset. A malicious user can
> 	exploit such situation.
> 
> 	For example, the ceilometer query every 4 minutes, and
> 	suspend/resume take 1 minutes (not exact length, just example).
> 	Then if user try to reset every 4 min, right after the ceilometer
> 	polling, and then execute 3 minutes. With such arrangement,
> 	ceilometer can't detect the reset. Yes, it's difficult to achieve
> 	this detect, but once achieved it, it will save a lot of cost.
> 
> 	My proposal is to fix this issue from nova side as it's nova bug.
> 	Per my understanding, the reason is different view of
> 	openstack/libvirt. For openstack, the instance is a logic object,
> 	thus still exist after suspend/resumt, while for libvirt, it's a
> 	real object, qemu process, which does not exist after
> 	suspend/resume. Nova should translate libvirt's view to openstack's
> 	view.  And this issue happens to other functionality, like nova's
> 	get_diagnostic().  But I agree with Eglynn that there are several
> 	implementation difficulties to fix it from nova side.


Here's another datapoint to consider, the bandwidth usage accounting
code in nova (currently only implemented for the xenapi driver) uses
a fairly basic scheme to detect resets simply by remembering the last
value and checking if the sequence appears monotonic:

  https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L2928

I don't know if this code is intended to produce rock-solid or just
"advisory" stats, but it's interesting that it has taken this simple
approach.

As discussed on IRC earlier, I'd be leery of going down the road of
enforcing a "fix" on the nova side, such that resets were detected and
adjusted for prior to being reported (apart from the complexity of
reliably storing the previous value, there are multiple different
counters that could potentially be reset in different ways, possibly
depending on the hypervisor driver etc.) 

However doing the adjustment within the metering store would always
be possible (either by storing a running offset as well as the reported
values, or switching to a delta-based measurement), as would isolating
the local maxima for max-min style queries (though we'd need to bound
the computational cost of that approach).

Of course that doesn't help in terms of detecting the edge case
where the reset is occurring so rapidly that the post-reset value
overtakes the last known value *before* the next sample is
taken. AFAICS this could only be addressed in code that's aware
of all the possible cases where a reset can occur, and is also in
a position to trigger the appropriate actions to enable the
adjustment (e.g. nova-compute persisting the latest stats when it
receives a 'suspend_instance' RPC message).

Cheers,
Eoghan



More information about the OpenStack-dev mailing list