Re: [ops][largescale-sig] How many compute nodes in a single cluster ?

3 Feb 2021

      Yes, totally agree with that, on our side we are used to monitor the
number of neutron ports (and espacially the number of ports in BUILD
state).

As usually an instance is having one port in our cloud, number of
instances is closed to number of ports.

About the cellsv2, we are mostly struggling on neutron side, so cells
are not helping us.

-- 
Arnaud Morin

On 03.02.21 - 09:05, David Ivey wrote:
...
I am not sure simply going off the number of compute nodes is a good
representation of scaling issues. I think it has a lot more to do with
density/networks/ports and the rate of churn in the environment, but I
could be wrong. For example, I only have 80 high density computes (64 or
128 CPU's with ~400 instances per compute) and I run into the same scaling
issues that are described in the Large Scale Sig and have to do a lot of
tuning to keep the environment stable. My environment is also kinda unique
in the way mine gets used as I have 2k to 4k instances torn down and
rebuilt within an hour or two quite often so my API's are constantly
bombarded.
On Tue, Feb 2, 2021 at 3:15 PM Erik Olof Gunnar Andersson <
eandersson@blizzard.com> wrote:
...
...
the old value of 500 nodes max has not been true for a very long time
rabbitmq and the db still tends to be the bottelneck to scale however
beyond 1500 nodes
outside of the operational overhead.
We manage our scale with regions as well. With 1k nodes our RabbitMQ
isn't breaking a sweat, and no signs that the database would be hitting any
limits. Our issues have been limited to scaling Neutron and VM scheduling
on Nova mostly due to, NUMA pinning.
------------------------------
*From:* Sean Mooney <smooney@redhat.com>
*Sent:* Tuesday, February 2, 2021 9:50 AM
*To:* openstack-discuss@lists.openstack.org <
openstack-discuss@lists.openstack.org>
*Subject:* Re: [ops][largescale-sig] How many compute nodes in a single
cluster ?
On Tue, 2021-02-02 at 17:37 +0000, Arnaud Morin wrote:
...
Hey all,
I will start the answers :)
At OVH, our hard limit is around 1500 hypervisors on a region.
It also depends a lot on number of instances (and neutron ports).
The effects if we try to go above this number:
- load on control plane (db/rabbit) is increasing a lot
- "burst" load is hard to manage (e.g. restart of all neutron agent or
  nova computes is putting a high pressure on control plane)
- and of course, failure domain is bigger
Note that we dont use cells.
We are deploying multiple regions, but this is painful to manage /
understand for our clients.
We are looking for a solution to unify the regions, but we did not find
anything which could fit our needs for now.
i assume you do not see cells v2 as a replacment for multipel regions
because they
do not provide indepente falut domains and also because they are only a
nova feature
so it does not solve sclaing issue in other service like neutorn which are
streached acrooss
all cells.
cells are a scaling mechinm but the larger the cloud the harder it is to
upgrade and cells does not
help with that infact by adding more contoler it hinders upgrades.
seperate regoins can all be upgraded indepently and can be fault tolerant
if you dont share serviecs
between regjions and use fedeeration to avoid sharing keystone.
glad to hear you can manage 1500 compute nodes by the way.
the old value of 500 nodes max has not been true for a very long time
rabbitmq and the db still tends to be the bottelneck to scale however
beyond 1500 nodes
outside of the operational overhead.
...
Cheers,