[nova][ironic] Lock-related performance issue with update_resources periodic job
surya.seetharaman9 at gmail.com
Mon May 13 20:15:27 UTC 2019
On Mon, May 13, 2019 at 9:40 PM Jason Anderson <jasonanderson at uchicago.edu>
> After investigation, the root cause appeared to be contention between the
> update_resources periodic task and the instance claim step. There is one
> semaphore "compute_resources" that is used to control every access within
> the resource_tracker. In our case, what was happening was the
> update_resources job, which runs every minute by default, was constantly
> queuing up accesses to this semaphore, because each hypervisor is updated
> independently, in series. This meant that, for us, each Ironic node was
> being processed and was holding the semaphore during its update (which took
> about 2-5 seconds in practice.) Multiply this by 150 and our update task
> was running constantly. Because an instance claim also needs to access this
> semaphore, this led to instances getting stuck in the "Build" state, after
> scheduling, for tens of minutes on average. There seemed to be some
> probabilistic effect here, which I hypothesize is related to the locking
> mechanism not using a "fair" lock (first-come, first-served) by default.
> Our fix was to drastically increase the interval this task runs at--from
> every 1 minute to every 12 hours. We only provision bare metal, so my
> rationale was that the periodic full resource sync was less important and
> mostly helpful for fixing weird things where somehow Placement's state got
> out of sync with Nova's somehow.
> I'm wondering, after all this, if it makes sense to rethink this
> one-semaphore thing, and instead create a per-hypervisor semaphore when
> doing the resource syncing. I can't think of a reason why the entire set of
> hypervisors needs to be considered as a whole when doing this task, but I
> could very well be missing something.
> *TL;DR*: if you have one nova-compute process managing lots of Ironic
> hypervisors, consider tweaking the update_resources_interval to a higher
> value, especially if you're seeing instances stuck in the Build state for a
We faced the same problem at CERN when we upgraded to rocky (we have ~2300
nodes on a single compute) like Eric said, and we set the
[compute]resource_provider_association_refresh to a large value (this
definitely helps by stopping the syncing of traits/aggregates and provider
tree cache info stuff in terms of chattiness with placement) and inspite of
that it doesn't scale that well for us. We still find the periodic task
taking too much of time which causes the locking to hold up the claim for
instances in BUILD state (the exact same problem you described). While one
way to tackle this like you said is to set the "update_resources_interval"
to a higher value - we were not sure how much out of sync things would get
with placement, so it will be interesting to see how this spans out for you
- another way out would be to use multiple computes and spread the nodes
around (though this is also a pain to maintain IMHO) which is what we are
looking into presently.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the openstack-discuss