[openstack-dev] [ceilometer] The reset on the cumulative counter

Jiang, Yunhong yunhong.jiang at intel.com
Mon Nov 26 13:27:17 UTC 2012


Hi,
	Eglynn and I discussed about how to handle the cumulative counter on IRC and we want to continue the discussion in the ML to get more feedback. The IRC log is in http://pastebin.com/byKvhNcJ. Eglynn, please correct me if I make any mistake.

	The issue is the cpu_info[cpu_time] is reset to 0 when a domain is suspend/resume. There are already some discussion on it already, and several potential solution raised.
	Jd discussed this firstly in https://bugs.launchpad.net/ceilometer/+bug/1061817 and then it's discussed in ML at http://www.mail-archive.com/openstack@lists.launchpad.net/msg17781.html.
	
	There are several method on this, like JD's MAX()- FIRST(), or asalkeld's idea of store the offset. Both try to detect if there is reset happen by checking if there is counter reverse. (again, Eglynn, please correct me if I'm wrong).

	However, during the discussion, we think that if the suspend/resume cycle is very rapid, and the ceilometer polling frequency is set up so that the stats appear to be in a monotonic sequence, then ceilometer will failed to detect the reset. A malicious user can exploit such situation. 

	For example, the ceilometer query every 4 minutes, and suspend/resume take 1 minutes (not exact length, just example). Then if user try to reset every 4 min, right after the ceilometer polling, and then execute 3 minutes. With such arrangement, ceilometer can't detect the reset. Yes, it's difficult to achieve this detect, but once achieved it, it will save a lot of cost.

	My proposal is to fix this issue from nova side as it's nova bug. Per my understanding, the reason is different view of openstack/libvirt. For openstack, the instance is a logic object, thus still exist after suspend/resumt, while for libvirt, it's a real object, qemu process, which does not exist after suspend/resume. Nova should translate libvirt's view to openstack's view.  And this issue happens to other functionality, like nova's get_diagnostic().  But I agree with Eglynn that there are several implementation difficulties to fix it from nova side.

	Hope to get feedback from more person.

Thanks
--jyh



More information about the OpenStack-dev mailing list