The performance of the metadata query with cloud-init has been causing some people problems (it's so slow cloud-init times out!), and has led to the suggestion that we need lots of caching. (My hypothesis is that we don't...)<div>
<br></div><div>By turning on SQL debugging in SQL Alchemy (for which I've proposed a patch for Essex: <a href="https://review.openstack.org/#change,5783">https://review.openstack.org/#change,5783</a>), I was able to capture the SQL statements.</div>
<div><br></div><div>I'm focusing on the SQL statements for the metadata call.</div><div><br></div><div>The code does this:</div><div><br></div><div>1) Checks the cache to see if it has the data</div><div>2) Makes a message-bus call to the network service to get the fixed_ip info from the address</div>
<div>3) Looks up all sort of metadata in the database</div><div>4) Formats the reply</div><div><br></div><div>#1 means that the first call is slower than the others, so we need to focus on the first call.</div><div>#2 could be problematic, if the message queue is overloaded or if the network service is slow to response</div>
<div>#3 could be problematic if the DB isn't working properly</div><div>#4 is hopefully not the problem.</div><div><br></div><div>The relevant SQL log from the API server: <a href="http://paste.openstack.org/show/12109/">http://paste.openstack.org/show/12109/</a></div>
<div><br>And from the network server: <a href="http://paste.openstack.org/show/12116/">http://paste.openstack.org/show/12116/</a></div><div><br></div><div>I've analyzed each of the SQL statements:</div><div><br></div>
<div>API</div><div><br></div><div><div><a href="http://paste.openstack.org/show/12110/">http://paste.openstack.org/show/12110/</a> (Need to check that there isn't a table scan when instance_info_caches is large)</div>
</div><div><a href="http://paste.openstack.org/show/12111/">http://paste.openstack.org/show/12111/</a> Table scan on services table, but this is presumably smallish</div><div><a href="http://paste.openstack.org/show/12112/">http://paste.openstack.org/show/12112/</a> No index. Table scan on s3_images table. Also this table is MyISAM. Also seems to insert rows on the first call (not shown). Evil.</div>
<div><a href="http://paste.openstack.org/show/12113/">http://paste.openstack.org/show/12113/</a> </div><div><a href="http://paste.openstack.org/show/12114/">http://paste.openstack.org/show/12114/</a> block_device_mapping is MyISAM.</div>
<div><br></div><div><br></div><div>Network</div><div><br></div><div><a href="http://paste.openstack.org/show/12117/">http://paste.openstack.org/show/12117/</a></div><div><a href="http://paste.openstack.org/show/12118/">http://paste.openstack.org/show/12118/</a> (Fetch virtual_interface by instance_id)</div>
<div><a href="http://paste.openstack.org/show/12119/">http://paste.openstack.org/show/12119/</a> (Fetch network by id)</div><div><a href="http://paste.openstack.org/show/12120/">http://paste.openstack.org/show/12120/</a> Missing index => table scan on networks. Unnecessary row re-fetch.</div>
<div><a href="http://paste.openstack.org/show/12121/">http://paste.openstack.org/show/12121/</a> Missing index => table scan on virtual_interfaces. Unnecessary row-refetch.</div><div><a href="http://paste.openstack.org/show/12122/">http://paste.openstack.org/show/12122/</a> (Fetch fixed_ips on virtual interface)</div>
<div><a href="http://paste.openstack.org/show/12123/">http://paste.openstack.org/show/12123/</a> Missing index => table scan on networks. Unnecessary row re-fetch. (Double re-fetch. What does it mean?)</div><div><a href="http://paste.openstack.org/show/12124/">http://paste.openstack.org/show/12124/</a> Missing index => table scan on virtual_interfaces. Another re-re-fetch.</div>
<div><a href="http://paste.openstack.org/show/12125/">http://paste.openstack.org/show/12125/</a> Missing index => table scan on fixed_ips (Uh-oh - I hope you didn't allocate a /8!!). We do have this row from the virtual interface lookup; perhaps we could remove this query?</div>
<div><a href="http://paste.openstack.org/show/12126/">http://paste.openstack.org/show/12126/</a></div><div><a href="http://paste.openstack.org/show/12127/">http://paste.openstack.org/show/12127/</a></div><div><br></div><div>
<br></div><div>We still have a bunch of MyISAM tables (at least with a devstack install):</div><div><a href="http://paste.openstack.org/show/12115/">http://paste.openstack.org/show/12115/</a> </div><div><br></div><div><br>
</div><div><br></div><div>As I see it, these are the issues (in sort of priority order):</div><div><br></div><div><b>Critical</b></div><div><br></div><div>Table scan of fixed_ips on the network service (row per IP address?)</div>
<div>Use of MyISAM tables, particularly for s3_images and block_device_mapping</div><div>Table scan of virtual_interfaces (row per instance?)</div><div>Verify that MySQL isn't doing a table scan on <a href="http://paste.openstack.org/show/12110/">http://paste.openstack.org/show/12110/</a> when # of instances is large</div>
<div><div><br></div></div><div><b>Naughty</b></div><div><br></div><div><i>(Mostly because the tables are small)</i></div><div><br></div><div><div>Table scan of s3_images</div><div>Table scan of services</div></div><div>Table scan of networks</div>
<div><br></div><div><b>Low importance</b></div><div><br></div><div><i>(Re-fetches aren't a big deal if the queries are fast)</i></div><div><br></div><div>Row re-fetches & re-re-fetches</div><div><br></div><div><br>
</div><div><div><br></div><div>My install in nowhere near big enough for any of these to actually cause a real problem, so I'd love to get timings / a log from someone that is having a problem. Even the table scan of fixed_ips should be OK if you have enough RAM.</div>
<div><br></div><div>Justin<br><br><br>
</div></div>