[openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

Tim Bell Tim.Bell at cern.ch
Mon Mar 17 20:04:04 UTC 2014


At CERN, we've had similar issues when enabling telemetry. Our resource-list times out after 10 minutes when the proxies for HA assume there is no answer coming back. Keystone instances per cell have helped the situation a little so we can collect the data but there was a significant increase in load on the API endpoints.

I feel that some reference for production scale validation would be beneficial as part of TC approval to leave incubation in case there are issues such as this to be addressed.

Tim

> -----Original Message-----
> From: Jay Pipes [mailto:jaypipes at gmail.com]
> Sent: 17 March 2014 20:25
> To: openstack-dev at lists.openstack.org
> Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command
> 
...
> 
> Yep. At AT&T, we had to disable calls to GET /resources without any filters on it. The call would return hundreds of thousands of
> records, all being JSON-ified at the Ceilometer API endpoint, and the result would take minutes to return. There was no default limit
> on the query, which meant every single records in the database was returned, and on even a semi-busy system, that meant
> horrendous performance.
> 
> Besides the problem that the SQLAlchemy driver doesn't yet support pagination [1], the main problem with the get_resources() call is
> the underlying databases schema for the Sample model is wacky, and forces the use of a dependent subquery in the WHERE clause
> [2] which completely kills performance of the query to get resources.
> 
> [1]
> https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/impl_sqlalchemy.py#L436
> [2]
> https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/impl_sqlalchemy.py#L503
> 
> > The cli tests are supposed to be quick read-only sanity checks of the
> > cli functionality and really shouldn't ever be on the list of slowest
> > tests for a gate run.
> 
> Oh, the test is readonly all-right. ;) It's just that it's reading hundreds of thousands of records.
> 
> >  I think there was possibly a performance regression recently in
> > ceilometer because from I can tell this test used to normally take ~60 sec.
> > (which honestly is probably too slow for a cli test too) but it is
> > currently much slower than that.
> >
> > From logstash it seems there are still some cases when the resource
> > list takes as long to execute as it used to, but the majority of runs take a long time:
> > http://goo.gl/smJPB9
> >
> > In the short term I've pushed out a patch that will remove this test
> > from gate
> > runs: https://review.openstack.org/#/c/81036 But, I thought it would
> > be good to bring this up on the ML to try and figure out what changed
> > or why this is so slow.
> 
> I agree with removing the test from the gate in the short term. Medium to long term, the root causes of the problem (that GET
> /resources has no support for pagination on the query, there is no default for limiting results based on a since timestamp, and that
> the underlying database schema is non-optimal) should be addressed.
> 
> Best,
> -jay
> 
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list