[nova][ironic] Lock-related performance issue with update_resources periodic job

Jason Anderson jasonanderson at uchicago.edu
Tue Jun 11 15:43:02 UTC 2019

Hi Surya,

On 5/13/19 3:15 PM, Surya Seetharaman wrote:
We faced the same problem at CERN when we upgraded to rocky (we have ~2300 nodes on a single compute) like Eric said, and we set the [compute]resource_provider_association_refresh to a large value (this definitely helps by stopping the syncing of traits/aggregates and provider tree cache info stuff in terms of chattiness with placement) and inspite of that it doesn't scale that well for us. We still find the periodic task taking too much of time which causes the locking to hold up the claim for instances in BUILD state (the exact same problem you described). While one way to tackle this like you said is to set the "update_resources_interval" to a higher value - we were not sure how much out of sync things would get with placement, so it will be interesting to see how this spans out for you - another way out would be to use multiple computes and spread the nodes around (though this is also a pain to maintain IMHO) which is what we are looking into presently.

I wanted to let you know that we've been running this way in production for a few weeks now and it's had a noticeable improvement: instances are no longer sticking in the "Build" stage, pre-networking, for ages. We were able to track the improvement by comparing the Nova conductor logs ("Took {seconds} to build the instance" vs "Took {seconds} to spawn the instance on the hypervisor"; the delta should be as small as possible and in our case went from ~30 minutes to ~1 minute.) There have been a few cases where a resource provider claim got "stuck", but in practice it has been so infrequent that it potentially has other causes. As such, I can recommend increasing the interval time significantly. Currently we have it set to 6 hours.

I have not yet looked in to bringing in the other Nova patches used at CERN (and available in Stein). I did take a look at updating the locking mechanism, but do not have work to show for this yet.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190611/c323fe4d/attachment.html>

More information about the openstack-discuss mailing list