[nova][ironic] Lock-related performance issue with update_resources periodic job
Hey OpenStackers, I work on a cloud that allows users to reserve and provision bare metal instances with Ironic. We recently performed a long-overdue upgrade of our core components, all the way from Ocata up through Rocky. During this, we noticed that instance build requests were taking 4-5x (!!) as long as before. We have two deployments, one with ~150 bare metal nodes, and another with ~300. These are each managed by one nova-compute process running the Ironic driver. After investigation, the root cause appeared to be contention between the update_resources periodic task and the instance claim step. There is one semaphore "compute_resources" that is used to control every access within the resource_tracker. In our case, what was happening was the update_resources job, which runs every minute by default, was constantly queuing up accesses to this semaphore, because each hypervisor is updated independently, in series. This meant that, for us, each Ironic node was being processed and was holding the semaphore during its update (which took about 2-5 seconds in practice.) Multiply this by 150 and our update task was running constantly. Because an instance claim also needs to access this semaphore, this led to instances getting stuck in the "Build" state, after scheduling, for tens of minutes on average. There seemed to be some probabilistic effect here, which I hypothesize is related to the locking mechanism not using a "fair" lock (first-come, first-served) by default. Our fix was to drastically increase the interval this task runs at--from every 1 minute to every 12 hours. We only provision bare metal, so my rationale was that the periodic full resource sync was less important and mostly helpful for fixing weird things where somehow Placement's state got out of sync with Nova's somehow. I'm wondering, after all this, if it makes sense to rethink this one-semaphore thing, and instead create a per-hypervisor semaphore when doing the resource syncing. I can't think of a reason why the entire set of hypervisors needs to be considered as a whole when doing this task, but I could very well be missing something. TL;DR: if you have one nova-compute process managing lots of Ironic hypervisors, consider tweaking the update_resources_interval to a higher value, especially if you're seeing instances stuck in the Build state for a while. Cheers, Jason Anderson Cloud Computing Software Developer Consortium for Advanced Science and Engineering, The University of Chicago Mathematics & Computer Science Division, Argonne National Laboratory
Jason- You may find this article interesting [1]. It isn't clear whether your issue is the same as CERN's. But it would be interesting to know whether setting [compute]resource_provider_association_refresh [2] to a very large number (while leaving your periodic interval at its default) also mitigates the issue. Thanks, efried [1] https://techblog.web.cern.ch/techblog/post/placement-requests/ [2] https://docs.openstack.org/nova/latest/configuration/config.html#compute.res...
Hi Eric, thanks, that's very useful reading. I suspect the root issue is the same, as this isn't specific to Ironic per se, but rather is linked to a high # of hypervisors managed by one compute service. In our case, Placement was able to keep up just fine (though raising this job interval also lowered the number of requests to Placement significantly.) My suspicion was that it was less about load on Placement, and more about this lock contention. I will have to try pulling in these patches to test that. Cheers, /Jason ________________________________ From: Eric Fried <openstack@fried.cc> Sent: Monday, May 13, 2019 14:54 To: openstack-discuss@lists.openstack.org Subject: Re: [nova][ironic] Lock-related performance issue with update_resources periodic job Jason- You may find this article interesting [1]. It isn't clear whether your issue is the same as CERN's. But it would be interesting to know whether setting [compute]resource_provider_association_refresh [2] to a very large number (while leaving your periodic interval at its default) also mitigates the issue. Thanks, efried [1] https://techblog.web.cern.ch/techblog/post/placement-requests/ [2] https://docs.openstack.org/nova/latest/configuration/config.html#compute.res...
Hi Jason, On Mon, May 13, 2019 at 9:40 PM Jason Anderson <jasonanderson@uchicago.edu> wrote:
After investigation, the root cause appeared to be contention between the update_resources periodic task and the instance claim step. There is one semaphore "compute_resources" that is used to control every access within the resource_tracker. In our case, what was happening was the update_resources job, which runs every minute by default, was constantly queuing up accesses to this semaphore, because each hypervisor is updated independently, in series. This meant that, for us, each Ironic node was being processed and was holding the semaphore during its update (which took about 2-5 seconds in practice.) Multiply this by 150 and our update task was running constantly. Because an instance claim also needs to access this semaphore, this led to instances getting stuck in the "Build" state, after scheduling, for tens of minutes on average. There seemed to be some probabilistic effect here, which I hypothesize is related to the locking mechanism not using a "fair" lock (first-come, first-served) by default.
Our fix was to drastically increase the interval this task runs at--from every 1 minute to every 12 hours. We only provision bare metal, so my rationale was that the periodic full resource sync was less important and mostly helpful for fixing weird things where somehow Placement's state got out of sync with Nova's somehow.
I'm wondering, after all this, if it makes sense to rethink this one-semaphore thing, and instead create a per-hypervisor semaphore when doing the resource syncing. I can't think of a reason why the entire set of hypervisors needs to be considered as a whole when doing this task, but I could very well be missing something.
*TL;DR*: if you have one nova-compute process managing lots of Ironic hypervisors, consider tweaking the update_resources_interval to a higher value, especially if you're seeing instances stuck in the Build state for a while.
We faced the same problem at CERN when we upgraded to rocky (we have ~2300 nodes on a single compute) like Eric said, and we set the [compute]resource_provider_association_refresh to a large value (this definitely helps by stopping the syncing of traits/aggregates and provider tree cache info stuff in terms of chattiness with placement) and inspite of that it doesn't scale that well for us. We still find the periodic task taking too much of time which causes the locking to hold up the claim for instances in BUILD state (the exact same problem you described). While one way to tackle this like you said is to set the "update_resources_interval" to a higher value - we were not sure how much out of sync things would get with placement, so it will be interesting to see how this spans out for you - another way out would be to use multiple computes and spread the nodes around (though this is also a pain to maintain IMHO) which is what we are looking into presently. -- Regards, Surya.
Hi Surya, On 5/13/19 3:15 PM, Surya Seetharaman wrote: We faced the same problem at CERN when we upgraded to rocky (we have ~2300 nodes on a single compute) like Eric said, and we set the [compute]resource_provider_association_refresh to a large value (this definitely helps by stopping the syncing of traits/aggregates and provider tree cache info stuff in terms of chattiness with placement) and inspite of that it doesn't scale that well for us. We still find the periodic task taking too much of time which causes the locking to hold up the claim for instances in BUILD state (the exact same problem you described). While one way to tackle this like you said is to set the "update_resources_interval" to a higher value - we were not sure how much out of sync things would get with placement, so it will be interesting to see how this spans out for you - another way out would be to use multiple computes and spread the nodes around (though this is also a pain to maintain IMHO) which is what we are looking into presently. I wanted to let you know that we've been running this way in production for a few weeks now and it's had a noticeable improvement: instances are no longer sticking in the "Build" stage, pre-networking, for ages. We were able to track the improvement by comparing the Nova conductor logs ("Took {seconds} to build the instance" vs "Took {seconds} to spawn the instance on the hypervisor"; the delta should be as small as possible and in our case went from ~30 minutes to ~1 minute.) There have been a few cases where a resource provider claim got "stuck", but in practice it has been so infrequent that it potentially has other causes. As such, I can recommend increasing the interval time significantly. Currently we have it set to 6 hours. I have not yet looked in to bringing in the other Nova patches used at CERN (and available in Stein). I did take a look at updating the locking mechanism, but do not have work to show for this yet. Cheers, /Jason
Hi, we do have the issue of ironic instances taking a lot of time to start being created (The same Jason described). This is because the resource tracker takes >30 minutes to cycle (~2500 ironic nodes in one nova-compute). Meanwhile operations are "queue" until it finish. To speed up the resource tracker we use: https://review.opendev.org/#/c/637225/ We are working in shard the nova-compute for ironic. I think that is the right way to go. Considering the experience described by Jason we now increased the "update_resources_interval" to 24h. Yes, the "queue" issue disappeared. We will report back if you find some weird unexpected consequence. Belmiro CERN On Tue, Jun 11, 2019 at 5:56 PM Jason Anderson <jasonanderson@uchicago.edu> wrote:
Hi Surya,
On 5/13/19 3:15 PM, Surya Seetharaman wrote:
We faced the same problem at CERN when we upgraded to rocky (we have ~2300 nodes on a single compute) like Eric said, and we set the [compute]resource_provider_association_refresh to a large value (this definitely helps by stopping the syncing of traits/aggregates and provider tree cache info stuff in terms of chattiness with placement) and inspite of that it doesn't scale that well for us. We still find the periodic task taking too much of time which causes the locking to hold up the claim for instances in BUILD state (the exact same problem you described). While one way to tackle this like you said is to set the "update_resources_interval" to a higher value - we were not sure how much out of sync things would get with placement, so it will be interesting to see how this spans out for you - another way out would be to use multiple computes and spread the nodes around (though this is also a pain to maintain IMHO) which is what we are looking into presently.
I wanted to let you know that we've been running this way in production for a few weeks now and it's had a noticeable improvement: instances are no longer sticking in the "Build" stage, pre-networking, for ages. We were able to track the improvement by comparing the Nova conductor logs ("Took {seconds} to build the instance" vs "Took {seconds} to spawn the instance on the hypervisor"; the delta should be as small as possible and in our case went from ~30 minutes to ~1 minute.) There have been a few cases where a resource provider claim got "stuck", but in practice it has been so infrequent that it potentially has other causes. As such, I can recommend increasing the interval time significantly. Currently we have it set to 6 hours.
I have not yet looked in to bringing in the other Nova patches used at CERN (and available in Stein). I did take a look at updating the locking mechanism, but do not have work to show for this yet.
Cheers,
/Jason
Ah heck, I had totally forgotten about that patch. If it's working for you, let me get it polished up and merged. We could probably justify backporting it too. Matt? efried
On Jul 4, 2019, at 03:14, Belmiro Moreira <moreira.belmiro.email.lists@gmail.com> wrote:
Hi, we do have the issue of ironic instances taking a lot of time to start being created (The same Jason described). This is because the resource tracker takes >30 minutes to cycle (~2500 ironic nodes in one nova-compute). Meanwhile operations are "queue" until it finish. To speed up the resource tracker we use: https://review.opendev.org/#/c/637225/
We are working in shard the nova-compute for ironic. I think that is the right way to go.
Considering the experience described by Jason we now increased the "update_resources_interval" to 24h. Yes, the "queue" issue disappeared.
We will report back if you find some weird unexpected consequence.
Belmiro CERN
On Tue, Jun 11, 2019 at 5:56 PM Jason Anderson <jasonanderson@uchicago.edu> wrote: Hi Surya,
On 5/13/19 3:15 PM, Surya Seetharaman wrote: We faced the same problem at CERN when we upgraded to rocky (we have ~2300 nodes on a single compute) like Eric said, and we set the [compute]resource_provider_association_refresh to a large value (this definitely helps by stopping the syncing of traits/aggregates and provider tree cache info stuff in terms of chattiness with placement) and inspite of that it doesn't scale that well for us. We still find the periodic task taking too much of time which causes the locking to hold up the claim for instances in BUILD state (the exact same problem you described). While one way to tackle this like you said is to set the "update_resources_interval" to a higher value - we were not sure how much out of sync things would get with placement, so it will be interesting to see how this spans out for you - another way out would be to use multiple computes and spread the nodes around (though this is also a pain to maintain IMHO) which is what we are looking into presently. I wanted to let you know that we've been running this way in production for a few weeks now and it's had a noticeable improvement: instances are no longer sticking in the "Build" stage, pre-networking, for ages. We were able to track the improvement by comparing the Nova conductor logs ("Took {seconds} to build the instance" vs "Took {seconds} to spawn the instance on the hypervisor"; the delta should be as small as possible and in our case went from ~30 minutes to ~1 minute.) There have been a few cases where a resource provider claim got "stuck", but in practice it has been so infrequent that it potentially has other causes. As such, I can recommend increasing the interval time significantly. Currently we have it set to 6 hours.
I have not yet looked in to bringing in the other Nova patches used at CERN (and available in Stein). I did take a look at updating the locking mechanism, but do not have work to show for this yet.
Cheers,
/Jason
On 7/4/2019 5:16 AM, Eric Fried wrote:
Ah heck, I had totally forgotten about that patch. If it's working for you, let me get it polished up and merged.
We could probably justify backporting it too. Matt?
efried
Sure - get a bug opened for it, extra points if CERN can provide some before/after numbers with the patch applied to help justify it. From skimming the commit message, if the only side effect would be for sharing providers, which we don't really support yet, then backports seem OK. -- Thanks, Matt
Bug already associated with patch. I'll work on this next week. efried
On Jul 5, 2019, at 17:46, Matt Riedemann <mriedemos@gmail.com> wrote:
On 7/4/2019 5:16 AM, Eric Fried wrote:
https://review.opendev.org/#/c/637225/ Ah heck, I had totally forgotten about that patch. If it's working for you, let me get it polished up and merged. We could probably justify backporting it too. Matt? efried
Sure - get a bug opened for it, extra points if CERN can provide some before/after numbers with the patch applied to help justify it.
From skimming the commit message, if the only side effect would be for sharing providers, which we don't really support yet, then backports seem OK.
--
Thanks,
Matt
On Mon, May 13, 2019 at 9:40 PM Jason Anderson <jasonanderson@uchicago.edu> wrote:
I'm wondering, after all this, if it makes sense to rethink this one-semaphore thing, and instead create a per-hypervisor semaphore when doing the resource syncing. I can't think of a reason why the entire set of hypervisors needs to be considered as a whole when doing this task, but I could very well be missing something.
While theoretically this would be ideal, I am not sure how the COMPUTE_RESOURCE_SEMAPHORE can be tweaked into a per-hypervisor (for ironic) semaphore since its ultimately on a single compute-service's resource tracker, unless I am missing something obvious. Maybe the nova experts who know more this could shed some light. -- Regards, Surya.
On 5/13/2019 3:34 PM, Surya Seetharaman wrote:
I'm wondering, after all this, if it makes sense to rethink this one-semaphore thing, and instead create a per-hypervisor semaphore when doing the resource syncing. I can't think of a reason why the entire set of hypervisors needs to be considered as a whole when doing this task, but I could very well be missing something.
While theoretically this would be ideal, I am not sure how the COMPUTE_RESOURCE_SEMAPHORE can be tweaked into a per-hypervisor (for ironic) semaphore since its ultimately on a single compute-service's resource tracker, unless I am missing something obvious. Maybe the nova experts who know more this could shed some light.
I would think it would just be a matter of locking on the nodename. That would have the same effect for a non-ironic compute service where the driver should only be reporting a single nodename. But for a compute service managing ironic nodes, it would be more like a per-instance lock since the nodes are 1:1 with the instances managed on that host. Having said all that, the devil is in the details (and trying to refactor that very old and crusty RT code). -- Thanks, Matt
participants (5)
-
Belmiro Moreira
-
Eric Fried
-
Jason Anderson
-
Matt Riedemann
-
Surya Seetharaman