[openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

Sampath Priyankara sampath.priyankara at lab.ntt.co.jp
Thu Mar 20 10:28:12 UTC 2014


Hi,

  Re-architecting the schema might fix most of the performance issues of
resource_list.  
  And also, must do some work to improve the performance of meter-list.
  Is the Gordon's blue print gonna cover the both aspects ?
  https://blueprints.launchpad.net/ceilometer/+spec/big-data-sql
  
Sampath

-----Original Message-----
From: Neal, Phil [mailto:phil.neal at hp.com] 
Sent: Wednesday, March 19, 2014 12:17 AM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list
CLI command


> -----Original Message-----
> From: Tim Bell [mailto:Tim.Bell at cern.ch]
> Sent: Monday, March 17, 2014 2:04 PM
> To: OpenStack Development Mailing List (not for usage questions)
> Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer 
> resource_list CLI command
> 
> 
> At CERN, we've had similar issues when enabling telemetry. Our 
> resource-list times out after 10 minutes when the proxies for HA 
> assume there is no answer coming back. Keystone instances per cell 
> have helped the situation a little so we can collect the data but 
> there was a significant increase in load on the API endpoints.
> 
> I feel that some reference for production scale validation would be 
> beneficial as part of TC approval to leave incubation in case there 
> are issues such as this to be addressed.
> 
> Tim
> 
> > -----Original Message-----
> > From: Jay Pipes [mailto:jaypipes at gmail.com]
> > Sent: 17 March 2014 20:25
> > To: openstack-dev at lists.openstack.org
> > Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer
> resource_list CLI command
> >
> ...
> >
> > Yep. At AT&T, we had to disable calls to GET /resources without any 
> > filters
> on it. The call would return hundreds of thousands of
> > records, all being JSON-ified at the Ceilometer API endpoint, and 
> > the result
> would take minutes to return. There was no default limit
> > on the query, which meant every single records in the database was
> returned, and on even a semi-busy system, that meant
> > horrendous performance.
> >
> > Besides the problem that the SQLAlchemy driver doesn't yet support
> pagination [1], the main problem with the get_resources() call is
> > the underlying databases schema for the Sample model is wacky, and
> forces the use of a dependent subquery in the WHERE clause
> > [2] which completely kills performance of the query to get resources.
> >
> > [1]
> >
> https://github.com/openstack/ceilometer/blob/master/ceilometer/storage
> /
> impl_sqlalchemy.py#L436
> > [2]
> >
> https://github.com/openstack/ceilometer/blob/master/ceilometer/storage
> /
> impl_sqlalchemy.py#L503
> >
> > > The cli tests are supposed to be quick read-only sanity checks of 
> > > the cli functionality and really shouldn't ever be on the list of 
> > > slowest tests for a gate run.
> >
> > Oh, the test is readonly all-right. ;) It's just that it's reading 
> > hundreds of
> thousands of records.
> >
> > >  I think there was possibly a performance regression recently in 
> > > ceilometer because from I can tell this test used to normally take ~60
sec.
> > > (which honestly is probably too slow for a cli test too) but it is 
> > > currently much slower than that.
> > >
> > > From logstash it seems there are still some cases when the 
> > > resource list takes as long to execute as it used to, but the 
> > > majority of runs take a
> long time:
> > > http://goo.gl/smJPB9
> > >
> > > In the short term I've pushed out a patch that will remove this 
> > > test from gate
> > > runs: https://review.openstack.org/#/c/81036 But, I thought it 
> > > would be good to bring this up on the ML to try and figure out 
> > > what changed or why this is so slow.
> >
> > I agree with removing the test from the gate in the short term. 
> > Medium to
> long term, the root causes of the problem (that GET
> > /resources has no support for pagination on the query, there is no 
> > default
> for limiting results based on a since timestamp, and that
> > the underlying database schema is non-optimal) should be addressed.

Gordon has introduced a blueprint
https://blueprints.launchpad.net/ceilometer/+spec/big-data-sql with some
fixes for individual queries but +1 to the point of looking at
re-architecting the schema as an approach to fixing performance. We've also
seen some gains here at HP using batch writes as well but have temporarily
tabled that work in favor of getting a better-performing schema in place.
- Phil

> >
> > Best,
> > -jay
> >
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev





More information about the OpenStack-dev mailing list