[nova][ironic][ptg] Resource tracker scaling issues
arne.wiebalck at cern.ch
Mon Nov 11 10:19:59 UTC 2019
On 10.11.19 22:07, Matt Riedemann wrote:
> On 11/10/2019 10:44 AM, Balázs Gibizer wrote:
>> On 3500 baremetal nodes _update_available_resource takes 1.5 hour.
> Why have a single nova-compute service manage this many nodes? Or even
> Why not try to partition things a bit more reasonably like a normal cell
> where you might have ~200 nodes per compute service host (I think CERN
> keeps their cells to around 200 physical compute hosts for scaling)?
> That way you can also leverage the compute service hashring / failover
> feature for HA?
> I realize the locking stuff is not great, but at what point is it
> unreasonable to expect a single compute service to manage that many
I agree that using sharding and/or multiple cells to manage that many
nodes is sensible. One reason we haven't done it yet is that we got
away with this very simple setup so far ;)
Sharding with and/or within cells will help to some degree (and we are
actively looking into this as you probably know), but I think that
should not stop us from checking if there are algorithmic improvements
(e.g. when collecting the data), or if moving to a different locking
granularity or even parallelising the update are feasible additional
More information about the openstack-discuss