[openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

Tim Bell Tim.Bell at cern.ch
Mon Mar 17 20:04:04 UTC 2014

At CERN, we've had similar issues when enabling telemetry. Our resource-list times out after 10 minutes when the proxies for HA assume there is no answer coming back. Keystone instances per cell have helped the situation a little so we can collect the data but there was a significant increase in load on the API endpoints.

I feel that some reference for production scale validation would be beneficial as part of TC approval to leave incubation in case there are issues such as this to be addressed.


> -----Original Message-----
> From: Jay Pipes [mailto:jaypipes at gmail.com]
> Sent: 17 March 2014 20:25
> To: openstack-dev at lists.openstack.org
> Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command
> Yep. At AT&T, we had to disable calls to GET /resources without any filters on it. The call would return hundreds of thousands of
> records, all being JSON-ified at the Ceilometer API endpoint, and the result would take minutes to return. There was no default limit
> on the query, which meant every single records in the database was returned, and on even a semi-busy system, that meant
> horrendous performance.
> Besides the problem that the SQLAlchemy driver doesn't yet support pagination [1], the main problem with the get_resources() call is
> the underlying databases schema for the Sample model is wacky, and forces the use of a dependent subquery in the WHERE clause
> [2] which completely kills performance of the query to get resources.
> [1]
> https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/impl_sqlalchemy.py#L436
> [2]
> https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/impl_sqlalchemy.py#L503
> > The cli tests are supposed to be quick read-only sanity checks of the
> > cli functionality and really shouldn't ever be on the list of slowest
> > tests for a gate run.
> Oh, the test is readonly all-right. ;) It's just that it's reading hundreds of thousands of records.
> >  I think there was possibly a performance regression recently in
> > ceilometer because from I can tell this test used to normally take ~60 sec.
> > (which honestly is probably too slow for a cli test too) but it is
> > currently much slower than that.
> >
> > From logstash it seems there are still some cases when the resource
> > list takes as long to execute as it used to, but the majority of runs take a long time:
> > http://goo.gl/smJPB9
> >
> > In the short term I've pushed out a patch that will remove this test
> > from gate
> > runs: https://review.openstack.org/#/c/81036 But, I thought it would
> > be good to bring this up on the ML to try and figure out what changed
> > or why this is so slow.
> I agree with removing the test from the gate in the short term. Medium to long term, the root causes of the problem (that GET
> /resources has no support for pagination on the query, there is no default for limiting results based on a since timestamp, and that
> the underlying database schema is non-optimal) should be addressed.
> Best,
> -jay
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

More information about the OpenStack-dev mailing list