Hi, we do have the issue of ironic instances taking a lot of time to start being created (The same Jason described). This is because the resource tracker takes >30 minutes to cycle (~2500 ironic nodes in one nova-compute). Meanwhile operations are "queue" until it finish. To speed up the resource tracker we use: https://review.opendev.org/#/c/637225/ We are working in shard the nova-compute for ironic. I think that is the right way to go. Considering the experience described by Jason we now increased the "update_resources_interval" to 24h. Yes, the "queue" issue disappeared. We will report back if you find some weird unexpected consequence. Belmiro CERN On Tue, Jun 11, 2019 at 5:56 PM Jason Anderson <jasonanderson@uchicago.edu> wrote:
Hi Surya,
On 5/13/19 3:15 PM, Surya Seetharaman wrote:
We faced the same problem at CERN when we upgraded to rocky (we have ~2300 nodes on a single compute) like Eric said, and we set the [compute]resource_provider_association_refresh to a large value (this definitely helps by stopping the syncing of traits/aggregates and provider tree cache info stuff in terms of chattiness with placement) and inspite of that it doesn't scale that well for us. We still find the periodic task taking too much of time which causes the locking to hold up the claim for instances in BUILD state (the exact same problem you described). While one way to tackle this like you said is to set the "update_resources_interval" to a higher value - we were not sure how much out of sync things would get with placement, so it will be interesting to see how this spans out for you - another way out would be to use multiple computes and spread the nodes around (though this is also a pain to maintain IMHO) which is what we are looking into presently.
I wanted to let you know that we've been running this way in production for a few weeks now and it's had a noticeable improvement: instances are no longer sticking in the "Build" stage, pre-networking, for ages. We were able to track the improvement by comparing the Nova conductor logs ("Took {seconds} to build the instance" vs "Took {seconds} to spawn the instance on the hypervisor"; the delta should be as small as possible and in our case went from ~30 minutes to ~1 minute.) There have been a few cases where a resource provider claim got "stuck", but in practice it has been so infrequent that it potentially has other causes. As such, I can recommend increasing the interval time significantly. Currently we have it set to 6 hours.
I have not yet looked in to bringing in the other Nova patches used at CERN (and available in Stein). I did take a look at updating the locking mechanism, but do not have work to show for this yet.
Cheers,
/Jason