Open Stack

Thu Jan 10 16:34:30 UTC 2013

On Jan 9, 2013, at 6:56 PM, Joe Gordon <jogo at cloudscaling.com> wrote:

> I have started to go through everyone's suggestions, but I thought I would share some more preliminary findings.
> 
> Now that millisecond logging is enabled, the logs show a clearer picture of what is going on.
> 
> Here is stack trace from devstack for the command DisassociateAddress, (the slowest REST call triggered by the openstack exercises): 
> 
> http://paste.openstack.org/show/29103/ (from local build)
> http://paste.openstack.org/show/29104/ (from logs.openstack.org)
> 
> Note: neither pastebin have entire RPC message contents in them.
> 
> 
> Analysis:
> 
> http://paste.openstack.org/show/29103/ 
> 
> Total time: 4.8428030 seconds
> 
> * the 3 RPC calls take: 65ms, 50ms and 107ms
> 
> * RPC msg 79bbafe8079841e694640a7435f5d52f takes 1.174 seconds (7.080-5.906) to perform network_api.get_instance_id_by_floating_address, which consists of 3 just 3 DB calls.

This can be optimized heavily (although we should probably make a new method get_instance_uuid… for compatibility):
 a) the note is incorrect, we can return the uuid and skip the last call 
 b) the two calls can be converted into a single db call with a join (we only need the instance uuid)
 c) we can move this up to network.api and skip rpc completely (see my coments below about associate)

> 
> * RPC msg 4c9055aa7a034d6f85e99df7dc0a8fcc takes 3.109 seconds (10.191-7.082) to perform network_api.disassociate_floating_ip
> 
> * RPC msg 45841e95f05d4003a00bc059fade10b0 takes 0.3359 (10.527-10.191) to perform  network_api.get_instance_nw_info
> 
> * in RPC msg 4c9055aa7a034d6f85e99df7dc0a8fcc, n-net waits 0.604 seconds (8.410-7.806) Attempting to grab file lock "iptables", but n-net log shows nothing else is running (bad lock, or lack of logging?)  [Note: I added in logging when release a  file lock, and nothing is releases the file lock between attempting and got.]

The iptables lock is external so it is possible that n-cpu has the lock during this time.

> 
> * iptables commands take 2.52 seconds (10.93-8.41)

This seems like the best place to start optimizing. Looks like we might be able to optimize the iptables code a bit. For example we could save and restore all tables at once instead of doing each of them individually. It appears that most of the time is spent doing the actual shell command as opposed to in the python code.

> 
> 
> Further Questions:
> 
> * Why is network_api.get_instance_id_by_floating_address so slow? Why do we even need it, if its just doing DB calls?

This is probably leftover from when quantum was on the recieving end of the api call. Now that quantum replaces network.api, we could move these calls up to the api layer and save an rpc round trip. There is actually quite a bit of optimization here. Another big slowdown is in multihost mode the message is sent to 'any' network host and then it is rebroadcast to the right host.  
We can actually send the message to the right host on the api side and avoid the rebroadcast.  This should save us a lot unnecessary traffic and time. It can be done for associate_floating and disassociate_floating trivially. Allocate and deallocate network can be done as well, although we have to move the calling side up to compute.api where it has db access.

> * Why are all three calls synchronous, can we do the iptables commands asynchronously?

We may be able to fire off the disassociate at the same time. The nw info call has to happen afterwards
> 
> * How much time is spent in nova.db.api vs what sql itself reports (aka sqlalchemy overhead).

When I last looked it was mostly sqlalchemy, especially where it is hydrating multiple models.

> 
> Patches for better logging/debugging:
> 
> * https://review.openstack.org/#/c/19354/ - properly format milliseconds 
> * https://review.openstack.org/#/c/19076/ - enable mysql slow log in devstack
> * https://review.openstack.org/#/c/19358/ - log when release file lock
> 
> best,
> Joe
> 
> 
> On Mon, Jan 7, 2013 at 12:18 AM, Monty Taylor <mordred at inaugust.com> wrote:
> 
> 
> On 01/07/2013 07:55 AM, Monty Taylor wrote:
> >
> >
> > On 01/07/2013 06:46 AM, Mark McLoughlin wrote:
> >> On Fri, 2013-01-04 at 11:58 -0800, Joe Gordon wrote:
> >>> Conclusion
> >>> =========
> >>>
> >>> Nova REST calls take way long, even in small scale devstack/tempest
> >>> testing.  We should spend resources on fixing this for Grizzly.  I would
> >>> like to start a dialog on how we can fix this and see if anyone wants to
> >>> volunteer help.  Below are a few possible ideas of things that will help us
> >>> make the REST calls faster.
> >>>
> >>> * Continuously track REST API times of devstack/tempest jobs, to identify
> >>> slow API calls and watch for trends.  (logstash + statsd?)
> >>> * Identify the underlying cause of the slow performance.  It looks like
> >>> nova.db.api may be partially responsible, but not the only issue?  What
> >>> else is a bottleneck?
> >>> * Enable millisecond level time logging
> >>> https://review.openstack.org/#/c/18933/
> >>> * Add an option in devstack to enable python profiling?  Whats best way to
> >>> do this?
> >>> * Make more REST API related code asynchronous (via RPC)
> >>
> >> Summary is we're randomly seeing a multi-second delay in nova-api
> >> between the time a HTTP request is received and we send an AMQP message?
> >>
> >> Suspecting the DB calls is a good first guess. The ability to log slow
> >> DB calls would be nice. Perhaps MySQL has something we can use here,
> >> assuming we log some sort of client ID on the nova-api side that we can
> >> use to correlate with the MySQL logs.
> >
> > There is, you'll be happy to know, a MySQL feature called the slow query
> > log. You can enable it by adding:
> >
> > log_slow_queries       = /var/log/mysql/mysql-slow.log
> >
> > To the my.cnf. Additionally, you can put in the following:
> >
> > long_query_time = 1
> > log-queries-not-using-indexes
> >
> > The first is a threshold, in seconds, of how long a query needs to last
> > before it's considered slow. The second will cause all queries that did
> > not use indexes as part of their run to be logged.
> 
> https://review.openstack.org/19075
> https://review.openstack.org/19076
> 
> > Getting client ids passed to the db log for correlation will be
> > trickier. A lot of the big web properties will do tricks in their db
> > layer where they will add embedded comments into the sql passed in that
> > include information like client ids. I'm not sure how hard that sort of
> > tracing would be to do on our end -but turning on the slow query log
> > while doing these sorts of things is probably a great starting place.
> >
> >> External processes we run is also another place to look. Again, logging
> >> those processes that take longer than a certain threshold to run might
> >> be good.
> >>
> >> I'd also investigate whether it's some other greenthread blocking the
> >> API request thread from running. What would be really nice would be some
> >> way of logging when a particular greenthread ran for a long time while
> >> other greenthreads are pending.
> >>
> >> It would also be good to see some system level stats and see if it's
> >> just that the system is generally CPU or I/O bound while these slow
> >> calls are happening. Maybe logging CPU, memory, disk and network usage
> >> along with the top few processes every tenth of a second.
> >
> > Running dstat in a window while doing the other thing is fairly easy.
> > Perhaps we should add a dstat screen to devstack?
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20130110/1f9c40af/attachment.html>

Open Stack

[openstack-dev] [Openstack][Nova] Slow Nova API Calls from periodic-tempest-devstack-vm-check-hourly

OpenStack

Community

Documentation

Branding & Legal