[openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

Jay Pipes jaypipes at gmail.com
Mon Mar 17 19:24:30 UTC 2014


On Mon, 2014-03-17 at 14:55 -0400, Matthew Treinish wrote:
> Hi everyone,
> 
> So a little while ago we noticed that in all the gate runs one of the ceilometer
> cli tests is consistently in the list of slowest tests. (and often the slowest)
> This was a bit surprising given the nature of the cli tests we expect them to
> execute very quickly.
> 
> test_ceilometer_resource_list which just calls ceilometer resource_list from the
> CLI once is taking >=2 min to respond. For example:
> http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm-postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003
> (where it takes > 3min)

Yep. At AT&T, we had to disable calls to GET /resources without any
filters on it. The call would return hundreds of thousands of records,
all being JSON-ified at the Ceilometer API endpoint, and the result
would take minutes to return. There was no default limit on the query,
which meant every single records in the database was returned, and on
even a semi-busy system, that meant horrendous performance.

Besides the problem that the SQLAlchemy driver doesn't yet support
pagination [1], the main problem with the get_resources() call is the
underlying databases schema for the Sample model is wacky, and forces
the use of a dependent subquery in the WHERE clause [2] which completely
kills performance of the query to get resources.

[1]
https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/impl_sqlalchemy.py#L436
[2]
https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/impl_sqlalchemy.py#L503

> The cli tests are supposed to be quick read-only sanity checks of the cli
> functionality and really shouldn't ever be on the list of slowest tests for a
> gate run.

Oh, the test is readonly all-right. ;) It's just that it's reading
hundreds of thousands of records.

>  I think there was possibly a performance regression recently in
> ceilometer because from I can tell this test used to normally take ~60 sec.
> (which honestly is probably too slow for a cli test too) but it is currently
> much slower than that.
> 
> From logstash it seems there are still some cases when the resource list takes
> as long to execute as it used to, but the majority of runs take a long time:
> http://goo.gl/smJPB9
> 
> In the short term I've pushed out a patch that will remove this test from gate
> runs: https://review.openstack.org/#/c/81036 But, I thought it would be good to
> bring this up on the ML to try and figure out what changed or why this is so
> slow.

I agree with removing the test from the gate in the short term. Medium
to long term, the root causes of the problem (that GET /resources has no
support for pagination on the query, there is no default for limiting
results based on a since timestamp, and that the underlying database
schema is non-optimal) should be addressed.

Best,
-jay




More information about the OpenStack-dev mailing list