[ops][largescale-sig] How many compute nodes in a single cluster ?

David Ivey david.j.ivey at gmail.com
Wed Feb 3 14:05:30 UTC 2021


I am not sure simply going off the number of compute nodes is a good
representation of scaling issues. I think it has a lot more to do with
density/networks/ports and the rate of churn in the environment, but I
could be wrong. For example, I only have 80 high density computes (64 or
128 CPU's with ~400 instances per compute) and I run into the same scaling
issues that are described in the Large Scale Sig and have to do a lot of
tuning to keep the environment stable. My environment is also kinda unique
in the way mine gets used as I have 2k to 4k instances torn down and
rebuilt within an hour or two quite often so my API's are constantly
bombarded.

On Tue, Feb 2, 2021 at 3:15 PM Erik Olof Gunnar Andersson <
eandersson at blizzard.com> wrote:

> > the old value of 500 nodes max has not been true for a very long time
> rabbitmq and the db still tends to be the bottelneck to scale however
> beyond 1500 nodes
> outside of the operational overhead.
>
> We manage our scale with regions as well. With 1k nodes our RabbitMQ
> isn't breaking a sweat, and no signs that the database would be hitting any
> limits. Our issues have been limited to scaling Neutron and VM scheduling
> on Nova mostly due to, NUMA pinning.
> ------------------------------
> *From:* Sean Mooney <smooney at redhat.com>
> *Sent:* Tuesday, February 2, 2021 9:50 AM
> *To:* openstack-discuss at lists.openstack.org <
> openstack-discuss at lists.openstack.org>
> *Subject:* Re: [ops][largescale-sig] How many compute nodes in a single
> cluster ?
>
> On Tue, 2021-02-02 at 17:37 +0000, Arnaud Morin wrote:
> > Hey all,
> >
> > I will start the answers :)
> >
> > At OVH, our hard limit is around 1500 hypervisors on a region.
> > It also depends a lot on number of instances (and neutron ports).
> > The effects if we try to go above this number:
> > - load on control plane (db/rabbit) is increasing a lot
> > - "burst" load is hard to manage (e.g. restart of all neutron agent or
> >   nova computes is putting a high pressure on control plane)
> > - and of course, failure domain is bigger
> >
> > Note that we dont use cells.
> > We are deploying multiple regions, but this is painful to manage /
> > understand for our clients.
> > We are looking for a solution to unify the regions, but we did not find
> > anything which could fit our needs for now.
>
> i assume you do not see cells v2 as a replacment for multipel regions
> because they
> do not provide indepente falut domains and also because they are only a
> nova feature
> so it does not solve sclaing issue in other service like neutorn which are
> streached acrooss
> all cells.
>
> cells are a scaling mechinm but the larger the cloud the harder it is to
> upgrade and cells does not
> help with that infact by adding more contoler it hinders upgrades.
>
> seperate regoins can all be upgraded indepently and can be fault tolerant
> if you dont share serviecs
> between regjions and use fedeeration to avoid sharing keystone.
>
>
> glad to hear you can manage 1500 compute nodes by the way.
>
> the old value of 500 nodes max has not been true for a very long time
> rabbitmq and the db still tends to be the bottelneck to scale however
> beyond 1500 nodes
> outside of the operational overhead.
>
> >
> > Cheers,
> >
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210203/36dcf704/attachment.html>


More information about the openstack-discuss mailing list