<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">One more thing, we can also optimize get_instance_nw_info to skip the rpc, but once again we have to be careful about where it is called from a place where we don't have db access. Perhaps we could initially add an extra parameter local (defaulting to False) that makes it do the db request locally. As we move the calls to get_instance_nw_info up into compute.api, we can add local=True.<div><br></div><div>Vish</div><div><br><div><div>On Jan 10, 2013, at 8:34 AM, Vishvananda Ishaya <<a href="mailto:vishvananda@gmail.com">vishvananda@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><meta http-equiv="Content-Type" content="text/html charset=windows-1252"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><div>On Jan 9, 2013, at 6:56 PM, Joe Gordon <<a href="mailto:jogo@cloudscaling.com">jogo@cloudscaling.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div dir="ltr">I have started to go through everyone's suggestions, but I thought I would share some more preliminary findings.<div><br></div><div>Now that millisecond logging is enabled, the logs show a clearer picture of what is going on.</div>
<div><br></div><div>Here is stack trace from devstack for the command DisassociateAddress, (the slowest REST call triggered by the openstack exercises): </div><div><br></div><div><a href="http://paste.openstack.org/show/29103/" target="_blank">http://paste.openstack.org/show/29103/</a> (from local build)</div>
<div><a href="http://paste.openstack.org/show/29104/" target="_blank">http://paste.openstack.org/show/29104/</a> (from <a href="http://logs.openstack.org/" target="_blank">logs.openstack.org</a>)<br></div><div><br></div><div>
Note: neither pastebin have entire RPC message contents in them.</div><div><br></div><div><br></div><div>Analysis:</div><div><br></div><div><a href="http://paste.openstack.org/show/29103/" target="_blank">http://paste.openstack.org/show/29103/</a> <br>
</div><div><br></div><div>Total time: 4.8428030 seconds</div><div><br></div><div><font face="arial, helvetica, sans-serif">* the 3 RPC calls take: 65ms, 50ms and 107ms</font></div><div><font face="arial, helvetica, sans-serif"><br>
</font></div><div><font face="arial, helvetica, sans-serif">* RPC msg <span style="font-size:13px">79bbafe8079841e694640a7435f5d52f takes </span>1.174 seconds (7.080-5.906) to perform network_api.<span style="font-size:12px">get_instance_id_by_floating_address, which consists of 3 just 3 DB calls.</span></font></div></div></blockquote><div><br></div>This can be optimized heavily (although we should probably make a new method get_instance_uuid… for compatibility):</div><div> a) the note is incorrect, we can return the uuid and skip the last call </div><div> b) the two calls can be converted into a single db call with a join (we only need the instance uuid)</div><div> c) we can move this up to network.api and skip rpc completely (see my coments below about associate)</div><div><br><blockquote type="cite"><div dir="ltr">
<div><font face="arial, helvetica, sans-serif"><span style="font-size:12px"><br></span></font></div><div><font face="arial, helvetica, sans-serif"><span style="font-size:12px">* RPC msg </span><span style="font-size:13px">4c9055aa7a034d6f85e99df7dc0a8fcc takes </span>3.109 seconds (10.191-7.082)<span style="font-size:13px"> to perform network_api.</span><span style="font-size:12px">disassociate_floating_ip</span></font></div>
<div><font face="arial, helvetica, sans-serif"><span style="font-size:12px"><br></span></font></div><div><font face="arial, helvetica, sans-serif"><span style="font-size:12px">* RPC msg </span><span style="font-size:13px">45841e95f05d4003a00bc059fade10b0</span><span style="font-size:12px"> takes </span>0.3359 (10.527-10.191) to perform network_api.get_instance_nw_info</font></div>
<div><font face="arial, helvetica, sans-serif"><br></font></div><div><font face="arial, helvetica, sans-serif">* in RPC msg </font><span style="font-family:arial,helvetica,sans-serif;font-size:13px">4c9055aa7a034d6f85e99df7dc</span><span style="font-family:arial,helvetica,sans-serif;font-size:13px">0a8fcc</span><span style="font-family:arial,helvetica,sans-serif">, n-net waits 0.604 seconds (8.410-7.806) </span><span style="font-family:arial,helvetica,sans-serif;font-size:13px">Attempting to grab file lock "iptables", but n-net log shows nothing else is running (bad lock, or lack of logging?) [Note: I added in logging when release a file lock, and nothing is releases the file lock between attempting and got.]</span></div></div></blockquote><div><br></div>The iptables lock is external so it is possible that n-cpu has the lock during this time.</div><div><br><blockquote type="cite"><div dir="ltr">
<div><font face="arial, helvetica, sans-serif"><span style="font-size:13px"><br></span></font></div><div><font face="arial, helvetica, sans-serif"><span style="font-size:13px">* iptables commands take </span>2.52 seconds (10.93-8.41)</font></div></div></blockquote><div><br></div>This seems like the best place to start optimizing. Looks like we might be able to optimize the iptables code a bit. For example we could save and restore all tables at once instead of doing each of them individually. It appears that most of the time is spent doing the actual shell command as opposed to in the python code.</div><div><br></div><div><blockquote type="cite"><div dir="ltr">
</div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra">Further Questions:</div><div class="gmail_extra"><br></div><div class="gmail_extra">* Why is <span style="font-family:arial,helvetica,sans-serif">network_api.</span><span style="font-family:arial,helvetica,sans-serif;font-size:12px">get_instance_id_by_floating_address so slow? Why do we even need it, if its just doing DB calls?</span></div></blockquote><div><br></div>This is probably leftover from when quantum was on the recieving end of the api call. Now that quantum replaces network.api, we could move these calls up to the api layer and save an rpc round trip. There is actually quite a bit of optimization here. Another big slowdown is in multihost mode the message is sent to 'any' network host and then it is rebroadcast to the right host. </div><div>We can actually send the message to the right host on the api side and avoid the rebroadcast. This should save us a lot unnecessary traffic and time. It can be done for associate_floating and disassociate_floating trivially. Allocate and deallocate network can be done as well, although we have to move the calling side up to compute.api where it has db access.</div><div><br></div><div><blockquote type="cite">
<div class="gmail_extra"><font face="arial, helvetica, sans-serif"><span style="font-size:12px">* Why are all three calls synchronous, can we do the iptables commands asynchronously?</span></font></div></blockquote><div><br></div>We may be able to fire off the disassociate at the same time. The nw info call has to happen afterwards<br><blockquote type="cite"><div class="gmail_extra">
<font face="arial, helvetica, sans-serif"><span style="font-size:12px"><br></span></font></div><div class="gmail_extra"><font face="arial, helvetica, sans-serif"><span style="font-size:12px">* How much time is spent in nova.db.api vs what sql itself reports (aka sqlalchemy overhead).</span></font></div></blockquote><div><br></div>When I last looked it was mostly sqlalchemy, especially where it is hydrating multiple models.</div><div><br><blockquote type="cite">
<div class="gmail_extra"><br></div><div class="gmail_extra">Patches for better logging/debugging:</div><div class="gmail_extra"><br></div><div class="gmail_extra">* <a href="https://review.openstack.org/#/c/19354/">https://review.openstack.org/#/c/19354/</a> - properly format milliseconds </div>
<div class="gmail_extra">* <a href="https://review.openstack.org/#/c/19076/">https://review.openstack.org/#/c/19076/</a> - enable mysql slow log in devstack</div><div class="gmail_extra">* <a href="https://review.openstack.org/#/c/19358/">https://review.openstack.org/#/c/19358/</a> - log when release file lock</div>
<div class="gmail_extra"><br></div><div class="gmail_extra">best,</div><div class="gmail_extra">Joe</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Jan 7, 2013 at 12:18 AM, Monty Taylor <span dir="ltr"><<a href="mailto:mordred@inaugust.com" target="_blank">mordred@inaugust.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><br>
<br>
On 01/07/2013 07:55 AM, Monty Taylor wrote:<br>
><br>
><br>
> On 01/07/2013 06:46 AM, Mark McLoughlin wrote:<br>
>> On Fri, 2013-01-04 at 11:58 -0800, Joe Gordon wrote:<br>
>>> Conclusion<br>
>>> =========<br>
>>><br>
>>> Nova REST calls take way long, even in small scale devstack/tempest<br>
>>> testing. We should spend resources on fixing this for Grizzly. I would<br>
>>> like to start a dialog on how we can fix this and see if anyone wants to<br>
>>> volunteer help. Below are a few possible ideas of things that will help us<br>
>>> make the REST calls faster.<br>
>>><br>
>>> * Continuously track REST API times of devstack/tempest jobs, to identify<br>
>>> slow API calls and watch for trends. (logstash + statsd?)<br>
>>> * Identify the underlying cause of the slow performance. It looks like<br>
>>> nova.db.api may be partially responsible, but not the only issue? What<br>
>>> else is a bottleneck?<br>
>>> * Enable millisecond level time logging<br>
>>> <a href="https://review.openstack.org/#/c/18933/" target="_blank">https://review.openstack.org/#/c/18933/</a><br>
>>> * Add an option in devstack to enable python profiling? Whats best way to<br>
>>> do this?<br>
>>> * Make more REST API related code asynchronous (via RPC)<br>
>><br>
>> Summary is we're randomly seeing a multi-second delay in nova-api<br>
>> between the time a HTTP request is received and we send an AMQP message?<br>
>><br>
>> Suspecting the DB calls is a good first guess. The ability to log slow<br>
>> DB calls would be nice. Perhaps MySQL has something we can use here,<br>
>> assuming we log some sort of client ID on the nova-api side that we can<br>
>> use to correlate with the MySQL logs.<br>
><br>
> There is, you'll be happy to know, a MySQL feature called the slow query<br>
> log. You can enable it by adding:<br>
><br>
> log_slow_queries = /var/log/mysql/mysql-slow.log<br>
><br>
> To the my.cnf. Additionally, you can put in the following:<br>
><br>
> long_query_time = 1<br>
> log-queries-not-using-indexes<br>
><br>
> The first is a threshold, in seconds, of how long a query needs to last<br>
> before it's considered slow. The second will cause all queries that did<br>
> not use indexes as part of their run to be logged.<br>
<br>
</div><a href="https://review.openstack.org/19075" target="_blank">https://review.openstack.org/19075</a><br>
<a href="https://review.openstack.org/19076" target="_blank">https://review.openstack.org/19076</a><br>
<div><br>
> Getting client ids passed to the db log for correlation will be<br>
> trickier. A lot of the big web properties will do tricks in their db<br>
> layer where they will add embedded comments into the sql passed in that<br>
> include information like client ids. I'm not sure how hard that sort of<br>
> tracing would be to do on our end -but turning on the slow query log<br>
> while doing these sorts of things is probably a great starting place.<br>
><br>
>> External processes we run is also another place to look. Again, logging<br>
>> those processes that take longer than a certain threshold to run might<br>
>> be good.<br>
>><br>
>> I'd also investigate whether it's some other greenthread blocking the<br>
>> API request thread from running. What would be really nice would be some<br>
>> way of logging when a particular greenthread ran for a long time while<br>
>> other greenthreads are pending.<br>
>><br>
>> It would also be good to see some system level stats and see if it's<br>
>> just that the system is generally CPU or I/O bound while these slow<br>
>> calls are happening. Maybe logging CPU, memory, disk and network usage<br>
>> along with the top few processes every tenth of a second.<br>
><br>
> Running dstat in a window while doing the other thing is fairly easy.<br>
> Perhaps we should add a dstat screen to devstack?<br>
><br>
> _______________________________________________<br>
> OpenStack-dev mailing list<br>
> <a href="mailto:OpenStack-dev@lists.openstack.org" target="_blank">OpenStack-dev@lists.openstack.org</a><br>
> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
><br>
<br>
_______________________________________________<br>
OpenStack-dev mailing list<br>
<a href="mailto:OpenStack-dev@lists.openstack.org" target="_blank">OpenStack-dev@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
</div></blockquote></div><br></div>
_______________________________________________<br>OpenStack-dev mailing list<br><a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br><a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br></blockquote></div><br></div></blockquote></div><br></div></body></html>