[nova][ironic][ptg] Resource tracker scaling issues
* COMPUTE_RESOURCE_SEMAPHORE blocks instance creation on all nodes (on the same host) while the _update_available_resource runs on all nodes. On 3500 baremetal nodes _update_available_resource takes 1.5 hour. * Do we still need _update_available_resource periodic task to run for ironic nodes? * Reduce the scope of the COMPUTE_RESOURCE_SEMAPHORE lock * https://review.opendev.org/#/c/682242/ * https://review.opendev.org/#/c/677790/ * changing a locking scheme is frightening => we need more testing
Agreement: * Do a tempest test with a lot of fake ironic node records to have a way to test if changing the locking scheme breaks anything * Log a bug and propose a patch for having a per-node lock instead of the same object for all the ResourceTrackers * See also whether concurrency helps * Propose a spec if you really want to pursue the idea of being somehow inconsistent with data by not having a lock
Cheers, gibi
On 11/10/2019 10:44 AM, Balázs Gibizer wrote:
On 3500 baremetal nodes _update_available_resource takes 1.5 hour.
Why have a single nova-compute service manage this many nodes? Or even 1000?
Why not try to partition things a bit more reasonably like a normal cell where you might have ~200 nodes per compute service host (I think CERN keeps their cells to around 200 physical compute hosts for scaling)?
That way you can also leverage the compute service hashring / failover feature for HA?
I realize the locking stuff is not great, but at what point is it unreasonable to expect a single compute service to manage that many nodes/instances?
Hi Matt,
On 10.11.19 22:07, Matt Riedemann wrote:
On 11/10/2019 10:44 AM, Balázs Gibizer wrote:
On 3500 baremetal nodes _update_available_resource takes 1.5 hour.
Why have a single nova-compute service manage this many nodes? Or even 1000?
Why not try to partition things a bit more reasonably like a normal cell where you might have ~200 nodes per compute service host (I think CERN keeps their cells to around 200 physical compute hosts for scaling)?
That way you can also leverage the compute service hashring / failover feature for HA?
I realize the locking stuff is not great, but at what point is it unreasonable to expect a single compute service to manage that many nodes/instances?
I agree that using sharding and/or multiple cells to manage that many nodes is sensible. One reason we haven't done it yet is that we got away with this very simple setup so far ;)
Sharding with and/or within cells will help to some degree (and we are actively looking into this as you probably know), but I think that should not stop us from checking if there are algorithmic improvements (e.g. when collecting the data), or if moving to a different locking granularity or even parallelising the update are feasible additional improvements.
Cheers, Arne
-- Arne Wiebalck CERN IT
Sharding with and/or within cells will help to some degree (and we are actively looking into this as you probably know), but I think that should not stop us from checking if there are algorithmic improvements (e.g. when collecting the data), or if moving to a different locking granularity or even parallelising the update are feasible additional improvements.
All of that code was designed around one node per compute host. In the ironic case it was expanded (hacked) to support N where N is not huge. Giving it a huge number, and using a driver where nodes go into maintenance/cleaning for long periods of time is asking for trouble.
Given there is only one case where N can legitimately be greater than one, I'm really hesitant to back a proposal to redesign it for large values of N.
Perhaps we as a team just need to document what sane, tested, and expected-to-work values for N are?
--Dan
On Mon, Nov 11, 2019 at 4:05 PM Dan Smith dms@danplanet.com wrote:
Sharding with and/or within cells will help to some degree (and we are actively looking into this as you probably know), but I think that should not stop us from checking if there are algorithmic improvements (e.g. when collecting the data), or if moving to a different locking granularity or even parallelising the update are feasible additional improvements.
All of that code was designed around one node per compute host. In the ironic case it was expanded (hacked) to support N where N is not huge. Giving it a huge number, and using a driver where nodes go into maintenance/cleaning for long periods of time is asking for trouble.
Given there is only one case where N can legitimately be greater than one, I'm really hesitant to back a proposal to redesign it for large values of N.
Perhaps we as a team just need to document what sane, tested, and expected-to-work values for N are?
What we discussed at the PTG was the fact that we only have one global semaphore for this module but we have N ResourceTracker python objects (where N is the number of Ironic nodes per compute service). As per CERN, it looks this semaphore blocks when updating periodically so we basically said it could only be a bugfix given we could create N semaphores instead. That said, as it could have some problems, we want to make sure we can test the change not only by the gate but also directly by CERN.
Another discussion was about having more than one thread for the compute service (ie. N threads) but my opinion was that we should first look at the above before discussing about any other way.
-S
--Dan
On Sun, 10 Nov 2019, Matt Riedemann wrote:
On 11/10/2019 10:44 AM, Balázs Gibizer wrote:
On 3500 baremetal nodes _update_available_resource takes 1.5 hour.
Why have a single nova-compute service manage this many nodes? Or even 1000?
Why not try to partition things a bit more reasonably like a normal cell where you might have ~200 nodes per compute service host (I think CERN keeps their cells to around 200 physical compute hosts for scaling)?
Without commenting on the efficacy of doing things this way, I can report that 1000 (or even 3500) instances (not nodes) is a thing that can happen in some openstack + vsphere setups and tends to exercise some of the same architectural problems that a lots-of- ironic (nodes) setup encounters.
As far as I can tell the root architecture problem is:
a) there are lots loops b) there is an expectation that those loops will have a small number of iterations
(b) is generally true for a run of the mill KVM setup, but not otherwise.
(b) not being true in other contexts creates an impedance mismatch that is hard to overcome without doing at least one of the two things suggested elsewhere in this thread:
1. manage fewer pieces per nova-compute (Matt) 2. "algorithmic improvement" (Arne)
On 2, I wonder if there's been any exploration of using something like a circular queue and time-bounding the periodic jobs? Or using separate processes? For the ironic and vsphere contexts, increased CPU usage by the nova-compute process does not impact on the workload resources, so parallization is likely a good option.
On 11/11/2019 7:03 AM, Chris Dent wrote:
Or using separate processes? For the ironic and vsphere contexts, increased CPU usage by the nova-compute process does not impact on the workload resources, so parallization is likely a good option.
I don't know how much it would help - someone would have to actually test it out and get metrics - but one easy win might just be using a thread or process executor pool here [1] so that N compute nodes could be processed through the update_available_resource periodic task concurrently, maybe $ncpu or some factor thereof. By default make it serialized for backward compatibility and non-ironic deployments. Making that too highly concurrent could have negative impacts on other things running on that host, like the neutron agent, or potentially storming conductor/rabbit with a ton of DB requests from that compute.
That doesn't help with the scenario that the big COMPUTE_RESOURCE_SEMAPHORE lock is held by the periodic task while spawning, moving, or deleting an instance that also needs access to the big lock to update the resource tracker, but baby steps if any steps in this area of the code would be my recommendation.
[1] https://github.com/openstack/nova/blob/20.0.0/nova/compute/manager.py#L8629
Hi, using several cells for the Ironic deployment would be great however it doesn't work with the current architecture. The nova ironic driver gets all the nodes available in Ironic. This means that if we have several cells all of them will report the same nodes! The other possibility is to have a dedicated Ironic instance per cell, but in this case it will be very hard to manage a large deployment.
What we are trying is to shard the ironic nodes between several nova-computes. nova/ironic deployment supports several nova-computes and it will be great if the RT nodes cycle is sharded between them.
But anyway, this will also require speeding up the big lock. It would be great if a compute node can handle more than 500 nodes. Considering our use case: 15k/500 = 30 compute nodes.
Belmiro CERN
On Mon, Nov 11, 2019 at 9:13 PM Matt Riedemann mriedemos@gmail.com wrote:
On 11/11/2019 7:03 AM, Chris Dent wrote:
Or using separate processes? For the ironic and vsphere contexts, increased CPU usage by the nova-compute process does not impact on the workload resources, so parallization is likely a good option.
I don't know how much it would help - someone would have to actually test it out and get metrics - but one easy win might just be using a thread or process executor pool here [1] so that N compute nodes could be processed through the update_available_resource periodic task concurrently, maybe $ncpu or some factor thereof. By default make it serialized for backward compatibility and non-ironic deployments. Making that too highly concurrent could have negative impacts on other things running on that host, like the neutron agent, or potentially storming conductor/rabbit with a ton of DB requests from that compute.
That doesn't help with the scenario that the big COMPUTE_RESOURCE_SEMAPHORE lock is held by the periodic task while spawning, moving, or deleting an instance that also needs access to the big lock to update the resource tracker, but baby steps if any steps in this area of the code would be my recommendation.
[1] https://github.com/openstack/nova/blob/20.0.0/nova/compute/manager.py#L8629
--
Thanks,
Matt
Hi, using several cells for the Ironic deployment would be great however it doesn't work with the current architecture. The nova ironic driver gets all the nodes available in Ironic. This means that if we have several cells all of them will report the same nodes! The other possibility is to have a dedicated Ironic instance per cell, but in this case it will be very hard to manage a large deployment.
That's a problem for more reasons than just your scale. However, doesn't this solve that problem?
https://specs.openstack.org/openstack/nova-specs/specs/stein/implemented/iro...
--Dan
Dan Smith just point me the conductor groups that were added in Stein. https://specs.openstack.org/openstack/nova-specs/specs/stein/implemented/iro... This is an interesting way to partition the deployment much better than the multiple nova-computes setup.
Thanks, Belmiro CERN
On Tue, Nov 12, 2019 at 5:06 PM Belmiro Moreira < moreira.belmiro.email.lists@gmail.com> wrote:
Hi, using several cells for the Ironic deployment would be great however it doesn't work with the current architecture. The nova ironic driver gets all the nodes available in Ironic. This means that if we have several cells all of them will report the same nodes! The other possibility is to have a dedicated Ironic instance per cell, but in this case it will be very hard to manage a large deployment.
What we are trying is to shard the ironic nodes between several nova-computes. nova/ironic deployment supports several nova-computes and it will be great if the RT nodes cycle is sharded between them.
But anyway, this will also require speeding up the big lock. It would be great if a compute node can handle more than 500 nodes. Considering our use case: 15k/500 = 30 compute nodes.
Belmiro CERN
On Mon, Nov 11, 2019 at 9:13 PM Matt Riedemann mriedemos@gmail.com wrote:
On 11/11/2019 7:03 AM, Chris Dent wrote:
Or using separate processes? For the ironic and vsphere contexts, increased CPU usage by the nova-compute process does not impact on the workload resources, so parallization is likely a good option.
I don't know how much it would help - someone would have to actually test it out and get metrics - but one easy win might just be using a thread or process executor pool here [1] so that N compute nodes could be processed through the update_available_resource periodic task concurrently, maybe $ncpu or some factor thereof. By default make it serialized for backward compatibility and non-ironic deployments. Making that too highly concurrent could have negative impacts on other things running on that host, like the neutron agent, or potentially storming conductor/rabbit with a ton of DB requests from that compute.
That doesn't help with the scenario that the big COMPUTE_RESOURCE_SEMAPHORE lock is held by the periodic task while spawning, moving, or deleting an instance that also needs access to the big lock to update the resource tracker, but baby steps if any steps in this area of the code would be my recommendation.
[1]
https://github.com/openstack/nova/blob/20.0.0/nova/compute/manager.py#L8629
--
Thanks,
Matt
On Tue, Nov 12, 2019 at 11:38 AM Belmiro Moreira < moreira.belmiro.email.lists@gmail.com> wrote:
Dan Smith just point me the conductor groups that were added in Stein.
https://specs.openstack.org/openstack/nova-specs/specs/stein/implemented/iro... This is an interesting way to partition the deployment much better than the multiple nova-computes setup.
Just a note, they aren't mutually exclusive. You can run multiple nova-computes to manage a single conductor group, whether for HA or because you're using groups for some other construct (cells, racks, halls, network zones, etc) which you want to shard further.
// jim
Thanks, Belmiro CERN
On Tue, Nov 12, 2019 at 5:06 PM Belmiro Moreira < moreira.belmiro.email.lists@gmail.com> wrote:
Hi, using several cells for the Ironic deployment would be great however it doesn't work with the current architecture. The nova ironic driver gets all the nodes available in Ironic. This means that if we have several cells all of them will report the same nodes! The other possibility is to have a dedicated Ironic instance per cell, but in this case it will be very hard to manage a large deployment.
What we are trying is to shard the ironic nodes between several nova-computes. nova/ironic deployment supports several nova-computes and it will be great if the RT nodes cycle is sharded between them.
But anyway, this will also require speeding up the big lock. It would be great if a compute node can handle more than 500 nodes. Considering our use case: 15k/500 = 30 compute nodes.
Belmiro CERN
On Mon, Nov 11, 2019 at 9:13 PM Matt Riedemann mriedemos@gmail.com wrote:
On 11/11/2019 7:03 AM, Chris Dent wrote:
Or using separate processes? For the ironic and vsphere contexts, increased CPU usage by the nova-compute process does not impact on the workload resources, so parallization is likely a good option.
I don't know how much it would help - someone would have to actually test it out and get metrics - but one easy win might just be using a thread or process executor pool here [1] so that N compute nodes could be processed through the update_available_resource periodic task concurrently, maybe $ncpu or some factor thereof. By default make it serialized for backward compatibility and non-ironic deployments. Making that too highly concurrent could have negative impacts on other things running on that host, like the neutron agent, or potentially storming conductor/rabbit with a ton of DB requests from that compute.
That doesn't help with the scenario that the big COMPUTE_RESOURCE_SEMAPHORE lock is held by the periodic task while spawning, moving, or deleting an instance that also needs access to the big lock to update the resource tracker, but baby steps if any steps in this area of the code would be my recommendation.
[1]
https://github.com/openstack/nova/blob/20.0.0/nova/compute/manager.py#L8629
--
Thanks,
Matt
Great! Thanks Jim.
I will later report our experience with conductor groups.
Belmiro CERN
On Tue, Nov 12, 2019 at 5:58 PM Jim Rollenhagen jim@jimrollenhagen.com wrote:
On Tue, Nov 12, 2019 at 11:38 AM Belmiro Moreira < moreira.belmiro.email.lists@gmail.com> wrote:
Dan Smith just point me the conductor groups that were added in Stein.
https://specs.openstack.org/openstack/nova-specs/specs/stein/implemented/iro... This is an interesting way to partition the deployment much better than the multiple nova-computes setup.
Just a note, they aren't mutually exclusive. You can run multiple nova-computes to manage a single conductor group, whether for HA or because you're using groups for some other construct (cells, racks, halls, network zones, etc) which you want to shard further.
// jim
Thanks, Belmiro CERN
On Tue, Nov 12, 2019 at 5:06 PM Belmiro Moreira < moreira.belmiro.email.lists@gmail.com> wrote:
Hi, using several cells for the Ironic deployment would be great however it doesn't work with the current architecture. The nova ironic driver gets all the nodes available in Ironic. This means that if we have several cells all of them will report the same nodes! The other possibility is to have a dedicated Ironic instance per cell, but in this case it will be very hard to manage a large deployment.
What we are trying is to shard the ironic nodes between several nova-computes. nova/ironic deployment supports several nova-computes and it will be great if the RT nodes cycle is sharded between them.
But anyway, this will also require speeding up the big lock. It would be great if a compute node can handle more than 500 nodes. Considering our use case: 15k/500 = 30 compute nodes.
Belmiro CERN
On Mon, Nov 11, 2019 at 9:13 PM Matt Riedemann mriedemos@gmail.com wrote:
On 11/11/2019 7:03 AM, Chris Dent wrote:
Or using separate processes? For the ironic and vsphere contexts, increased CPU usage by the nova-compute process does not impact on the workload resources, so parallization is likely a good option.
I don't know how much it would help - someone would have to actually test it out and get metrics - but one easy win might just be using a thread or process executor pool here [1] so that N compute nodes could be processed through the update_available_resource periodic task concurrently, maybe $ncpu or some factor thereof. By default make it serialized for backward compatibility and non-ironic deployments. Making that too highly concurrent could have negative impacts on other things running on that host, like the neutron agent, or potentially storming conductor/rabbit with a ton of DB requests from that compute.
That doesn't help with the scenario that the big COMPUTE_RESOURCE_SEMAPHORE lock is held by the periodic task while spawning, moving, or deleting an instance that also needs access to the big lock to update the resource tracker, but baby steps if any steps in this area of the code would be my recommendation.
[1]
https://github.com/openstack/nova/blob/20.0.0/nova/compute/manager.py#L8629
--
Thanks,
Matt
participants (8)
-
Arne Wiebalck
-
Balázs Gibizer
-
Belmiro Moreira
-
Chris Dent
-
Dan Smith
-
Jim Rollenhagen
-
Matt Riedemann
-
Sylvain Bauza