Thanks for your reply, a lot of useful info! We already identified that using separated rabbit cluster for neutron could improve the scalability. About the usage of NATS, I never tried this piece of software but definitely sounds a good fit for large cloud. On rabbitmq side they worked on a new kind of queue called "quorum" that are HA by design. The documentation is recommending to use quorum now instead of classic queues with HA. Does anyone know if there is a chance that oslo_messaging will manage such kind of queues? Beside the rabbit, we also monitor our database cluster (we are using mariadb with galera) very carefully. About it, we also think that splitting the cluster in multiple deployment could help improving, but while it's easy to say, it's time consuming to move an already running cloud to a new architecture :) Regards, -- Arnaud Morin On 03.02.21 - 14:55, Sean Mooney wrote:
On Wed, 2021-02-03 at 14:24 +0000, Arnaud Morin wrote:
Yes, totally agree with that, on our side we are used to monitor the number of neutron ports (and espacially the number of ports in BUILD state).
As usually an instance is having one port in our cloud, number of instances is closed to number of ports.
About the cellsv2, we are mostly struggling on neutron side, so cells are not helping us.
ack, that makes sense. there are some things you can do to help scale neutron. one semi simple step is if you are usign ml2/ovs, ml2/linux-bridge or ml2/sriov-nic-agent is to move neutron to its own rabbitmq instance. neutron using the default ml2 drivers tends to be quite chatty so placing those on there own rabbit instance can help. while its in conflict with ha requirements ensuring that clustering is not used and instead loadblanicn with something like pace maker to a signel rabbitmq server can also help. rabbmqs clustering ablity while improving Ha by removing a singel point of failure decreease the performance of rabbit so if you have good monitoring and simpley restat or redeploy rabbit quickly using k8s or something else like an active backup deplopment mediataed by pacemeaker can work much better then actully clutering.
if you use ml2/ovn that allows you to remove the need for the dhcp agent and l3 agent as well as the l2 agent per compute host. that signifcaltly reducece neutron rpc impact however ovn does have some partiy gaps and scaling issues of its own. if it works for you and you can use as a new enough version that allows the ovn southd process on the compute nodes to subscibe to a subset of noth/southdb update relevent to just that node i can help with scaling neutorn.
im not sure about usage fo feature like dvr or routed provider networks impact this as i mostly work on nova now but at least form a data plane point of view it can reduce contention on the networing nodes(where l3 agents ran) to do routing and nat on behalf of all compute nodes.
at some point it might make sense for neutorn to take a similar cells approch to its own architrue but given the ablity of it to delegate some or all of the networkign to extrenal network contoler like ovn/odl its never been clear that an in tree sharding mechium like cells was actully required.
one thing that i hope some one will have time to investate at some point is can we replace rabbitmq in general with nats. this general topic comes up with different technolgies form time to time. nats however look like it would actuly be a good match in terms of feature and intended use while being much lighter weight then rabbitmq and actully improving in performance the more nats server instance you cluster since that was a design constraint form the start.
i dont actully think neutorn acritrues or nova for that matter is inherintly flawed but a more moderne messagaing buts might help all distibuted services scale with fewer issues then they have today.