<div dir="ltr">I am not sure simply going off the number of compute nodes is a good representation of scaling issues. I think it has a lot more to do with density/networks/ports and the rate of churn in the environment, but I could be wrong. For example, I only have 80 high density computes (64 or 128 CPU's with ~400 instances per compute) and I run into the same scaling issues that are described in the Large Scale Sig and have to do a lot of tuning to keep the environment stable. My environment is also kinda unique in the way mine gets used as I have 2k to 4k instances torn down and rebuilt within an hour or two quite often so my API's are constantly bombarded. </div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Feb 2, 2021 at 3:15 PM Erik Olof Gunnar Andersson <<a href="mailto:eandersson@blizzard.com" target="_blank">eandersson@blizzard.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


<div dir="ltr">

<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

> <span style="font-size:14.6667px;background-color:rgb(255,255,255);display:inline">the old value of 500

 nodes max has not been true for a very long time</span><br>

<span style="font-size:14.6667px;background-color:rgb(255,255,255);display:inline">rabbitmq and the db still

 tends to be the bottelneck to scale however beyond 1500 nodes</span><br style="font-size:14.6667px;background-color:rgb(255,255,255)">

<span style="font-size:14.6667px;background-color:rgb(255,255,255);display:inline">outside of the operational

 overhead.</span></div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

<span style="font-size:14.6667px;background-color:rgb(255,255,255);display:inline"><br>

</span></div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

We manage our scale with regions as well. <span style="color:rgb(0,0,0);font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt">With 1k nodes our RabbitMQ isn't breaking a sweat, and no signs that the database would be hitting any limits. </span><span style="color:rgb(0,0,0);font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt">Our

 issues have been limited to scaling Neutron and VM scheduling on Nova mostly due to, NUMA pinning.</span></div>

<div id="gmail-m_-3996866023918791136gmail-m_3020518691070211360appendonsend"></div>

<hr style="display:inline-block;width:98%">

<div id="gmail-m_-3996866023918791136gmail-m_3020518691070211360divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Sean Mooney <<a href="mailto:smooney@redhat.com" target="_blank">smooney@redhat.com</a>><br>

<b>Sent:</b> Tuesday, February 2, 2021 9:50 AM<br>

<b>To:</b> <a href="mailto:openstack-discuss@lists.openstack.org" target="_blank">openstack-discuss@lists.openstack.org</a> <<a href="mailto:openstack-discuss@lists.openstack.org" target="_blank">openstack-discuss@lists.openstack.org</a>><br>

<b>Subject:</b> Re: [ops][largescale-sig] How many compute nodes in a single cluster ?</font>

<div> </div>

</div>

<div><font size="2"><span style="font-size:11pt">

<div>On Tue, 2021-02-02 at 17:37 +0000, Arnaud Morin wrote:<br>

> Hey all,<br>

> <br>

> I will start the answers :)<br>

> <br>

> At OVH, our hard limit is around 1500 hypervisors on a region.<br>

> It also depends a lot on number of instances (and neutron ports).<br>

> The effects if we try to go above this number:<br>

> - load on control plane (db/rabbit) is increasing a lot<br>

> - "burst" load is hard to manage (e.g. restart of all neutron agent or<br>

>   nova computes is putting a high pressure on control plane)<br>

> - and of course, failure domain is bigger<br>

> <br>

> Note that we dont use cells.<br>

> We are deploying multiple regions, but this is painful to manage /<br>

> understand for our clients.<br>

> We are looking for a solution to unify the regions, but we did not find<br>

> anything which could fit our needs for now.<br>

<br>

i assume you do not see cells v2 as a replacment for multipel regions because they

<br>

do not provide indepente falut domains and also because they are only a nova feature<br>

so it does not solve sclaing issue in other service like neutorn which are streached acrooss<br>

all cells.<br>

<br>

cells are a scaling mechinm but the larger the cloud the harder it is to upgrade and cells does not<br>

help with that infact by adding more contoler it hinders upgrades.<br>

<br>

seperate regoins can all be upgraded indepently and can be fault tolerant if you dont share serviecs<br>

between regjions and use fedeeration to avoid sharing keystone.<br>

<br>

<br>

glad to hear you can manage 1500 compute nodes by the way.<br>

<br>

the old value of 500 nodes max has not been true for a very long time<br>

rabbitmq and the db still tends to be the bottelneck to scale however beyond 1500 nodes<br>

outside of the operational overhead.<br>

<br>

> <br>

> Cheers,<br>

> <br>

<br>

<br>

<br>

</div>

</span></font></div>

</div>


</blockquote></div>