<div dir="ltr">Hi,<div>using several cells for the Ironic deployment would be great however it doesn't work with the current architecture.</div><div>The nova ironic driver gets all the nodes available in Ironic. This means that if we have several cells all of them will report the same nodes!</div><div>The other possibility is to have a dedicated Ironic instance per cell, but in this case it will be very hard to manage a large deployment.</div><div><br></div><div>What we are trying is to shard the ironic nodes between several nova-computes.</div><div>nova/ironic deployment supports several nova-computes and it will be great if the RT nodes cycle is sharded between them.</div><div><br></div><div>But anyway, this will also require speeding up the big lock.</div><div>It would be great if a compute node can handle more than 500 nodes.</div><div>Considering our use case: 15k/500 = 30 compute nodes.</div><div><br></div><div>Belmiro</div><div>CERN</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Nov 11, 2019 at 9:13 PM Matt Riedemann <<a href="mailto:mriedemos@gmail.com">mriedemos@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">On 11/11/2019 7:03 AM, Chris Dent wrote:<br>
> Or using<br>
> separate processes? For the ironic and vsphere contexts, increased<br>
> CPU usage by the nova-compute process does not impact on the<br>
> workload resources, so parallization is likely a good option.<br>
<br>
I don't know how much it would help - someone would have to actually <br>
test it out and get metrics - but one easy win might just be using a <br>
thread or process executor pool here [1] so that N compute nodes could <br>
be processed through the update_available_resource periodic task <br>
concurrently, maybe $ncpu or some factor thereof. By default make it <br>
serialized for backward compatibility and non-ironic deployments. Making <br>
that too highly concurrent could have negative impacts on other things <br>
running on that host, like the neutron agent, or potentially storming <br>
conductor/rabbit with a ton of DB requests from that compute.<br>
<br>
That doesn't help with the scenario that the big <br>
COMPUTE_RESOURCE_SEMAPHORE lock is held by the periodic task while <br>
spawning, moving, or deleting an instance that also needs access to the <br>
big lock to update the resource tracker, but baby steps if any steps in <br>
this area of the code would be my recommendation.<br>
<br>
[1] <br>
<a href="https://github.com/openstack/nova/blob/20.0.0/nova/compute/manager.py#L8629" rel="noreferrer" target="_blank">https://github.com/openstack/nova/blob/20.0.0/nova/compute/manager.py#L8629</a><br>
<br>
-- <br>
<br>
Thanks,<br>
<br>
Matt<br>
<br>
</blockquote></div>