Hi Matt,
On 10.11.19 22:07, Matt Riedemann wrote:
On 11/10/2019 10:44 AM, Balázs Gibizer wrote:
On 3500 baremetal nodes _update_available_resource takes 1.5 hour.
Why have a single nova-compute service manage this many nodes? Or even 1000?
Why not try to partition things a bit more reasonably like a normal cell where you might have ~200 nodes per compute service host (I think CERN keeps their cells to around 200 physical compute hosts for scaling)?
That way you can also leverage the compute service hashring / failover feature for HA?
I realize the locking stuff is not great, but at what point is it unreasonable to expect a single compute service to manage that many nodes/instances?
I agree that using sharding and/or multiple cells to manage that many nodes is sensible. One reason we haven't done it yet is that we got away with this very simple setup so far ;)
Sharding with and/or within cells will help to some degree (and we are actively looking into this as you probably know), but I think that should not stop us from checking if there are algorithmic improvements (e.g. when collecting the data), or if moving to a different locking granularity or even parallelising the update are feasible additional improvements.
Cheers, Arne
-- Arne Wiebalck CERN IT