Open Stack

Mon Nov 11 10:19:59 UTC 2019

Hi Matt,

On 10.11.19 22:07, Matt Riedemann wrote:
> On 11/10/2019 10:44 AM, Balázs Gibizer wrote:
>> On 3500 baremetal nodes _update_available_resource takes 1.5 hour.
> 
> Why have a single nova-compute service manage this many nodes? Or even 
> 1000?
> 
> Why not try to partition things a bit more reasonably like a normal cell 
> where you might have ~200 nodes per compute service host (I think CERN 
> keeps their cells to around 200 physical compute hosts for scaling)?
> 
> That way you can also leverage the compute service hashring / failover 
> feature for HA?
> 
> I realize the locking stuff is not great, but at what point is it 
> unreasonable to expect a single compute service to manage that many 
> nodes/instances?
> 

I agree that using sharding and/or multiple cells to manage that many
nodes is sensible. One reason we haven't done it yet is that we got
away with this very simple setup so far ;)

Sharding with and/or within cells will help to some degree (and we are 
actively looking into this as you probably know), but I think that
should not stop us from checking if there are algorithmic improvements
(e.g. when collecting the data), or if moving to a different locking
granularity or even parallelising the update are feasible additional
improvements.

Cheers,
  Arne

--
Arne Wiebalck
CERN IT

Open Stack

[nova][ironic][ptg] Resource tracker scaling issues

OpenStack

Community

Documentation

Branding & Legal