[nova][neutron][oslo][ops][kolla] rabbit bindings issue

Sean Mooney smooney at redhat.com
Sun Aug 16 13:37:18 UTC 2020


On Sat, 2020-08-15 at 20:13 -0400, Satish Patel wrote:
> Hi Sean,
> 
> Sounds good, but running rabbitmq for each service going to be little
> overhead also, how do you scale cluster (Yes we can use cellv2 but its
> not something everyone like to do because of complexity).

my understanding is that when using rabbitmq adding multiple rabbitmq servers in a cluster lowers
througput vs jsut 1 rabbitmq instance for any given excahnge. that is because the content of
the queue need to be syconised across the cluster. so if cinder nova and neutron share
a 3 node cluster and your compaure that to the same service deployed with cinder nova and neuton
each having there on rabbitmq service then the independent deployment will tend to out perform the
clustered solution. im not really sure if that has change i know tha thow clustering has been donw has evovled
over the years but in the past clustering was the adversary of scaling.

>  If we thinks
> rabbitMQ is growing pain then why community not looking for
> alternative option (kafka) etc..?
we have looked at alternivives several times
rabbit mq  wroks well enough ans scales well enough for most deployments.
there other amqp implimantation that scale better then rabbit, 
activemq and qpid are both reported to scale better but they perfrom worse
out of the box and need to be carfully tuned

in the past zeromq has been supported but peole did not maintain it.

kafka i dont think is a good alternative but nats https://nats.io/ might be.

for what its worth all nova deployment are cellv2 deployments with 1 cell from around pike/rocky
and its really not that complex. cells_v1 was much more complex bug part of the redesign
for cells_v2 was makeing sure there is only 1 code path. adding a second cell just need another
cell db and conductor to be deployed assuming you startted with a super conductor in the first
place. the issue is cells is only a nova feature no other service have cells so it does not help
you with cinder or neutron. as such cinder an neutron likely be the services that hit scaling limits first.
adopign cells in other services is not nessaryally the right approch either but when we talk about scale
we do need to keep in mind that cells is just for nova today.


> 
> On Fri, Aug 14, 2020 at 3:09 PM Sean Mooney <smooney at redhat.com> wrote:
> > 
> > On Fri, 2020-08-14 at 18:45 +0200, Fabian Zimmermann wrote:
> > > Hi,
> > > 
> > > i read somewhere that vexxhosts kubernetes openstack-Operator is running
> > > one rabbitmq Container per Service. Just the kubernetes self healing is
> > > used as "ha" for rabbitmq.
> > > 
> > > That seems to match with my finding: run rabbitmq standalone and use an
> > > external system to restart rabbitmq if required.
> > 
> > thats the design that was orginally planned for kolla-kubernetes orrignally
> > 
> > each service was to be deployed with its own rabbit mq server if it required one
> > and if it crashed it woudl just be recreated by k8s. it perfromace better then a cluster
> > and if you trust k8s or the external service enough to ensure it is recteated it
> > should be as effective a solution. you dont even need k8s to do that but it seams to be
> > a good fit if  your prepared to ocationally loose inflight rpcs.
> > if you not then you can configure rabbit to persite all message to disk and mont that on a shared
> > file system like nfs or cephfs so that when the rabbit instance is recreated the queue contency is
> > perserved. assuming you can take the perfromance hit of writing all messages to disk that is.
> > > 
> > >  Fabian
> > > 
> > > Satish Patel <satish.txt at gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:
> > > 
> > > > Fabian,
> > > > 
> > > > what do you mean?
> > > > 
> > > > > > I think vexxhost is running (1) with their openstack-operator - for
> > > > 
> > > > reasons.
> > > > 
> > > > On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz at gmail.com>
> > > > wrote:
> > > > > 
> > > > > Hello again,
> > > > > 
> > > > > just a short update about the results of my tests.
> > > > > 
> > > > > I currently see 2 ways of running openstack+rabbitmq
> > > > > 
> > > > > 1. without durable-queues and without replication - just one
> > > > 
> > > > rabbitmq-process which gets (somehow) restarted if it fails.
> > > > > 2. durable-queues and replication
> > > > > 
> > > > > Any other combination of these settings leads to more or less issues with
> > > > > 
> > > > > * broken / non working bindings
> > > > > * broken queues
> > > > > 
> > > > > I think vexxhost is running (1) with their openstack-operator - for
> > > > 
> > > > reasons.
> > > > > 
> > > > > I added [kolla], because kolla-ansible is installing rabbitmq with
> > > > 
> > > > replication but without durable-queues.
> > > > > 
> > > > > May someone point me to the best way to document these findings to some
> > > > 
> > > > official doc?
> > > > > I think a lot of installations out there will run into issues if - under
> > > > 
> > > > load - a node fails.
> > > > > 
> > > > >  Fabian
> > > > > 
> > > > > 
> > > > > Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann <
> > > > 
> > > > dev.faz at gmail.com>:
> > > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > just did some short tests today in our test-environment (without
> > > > 
> > > > durable queues and without replication):
> > > > > > 
> > > > > > * started a rally task to generate some load
> > > > > > * kill-9-ed rabbitmq on one node
> > > > > > * rally task immediately stopped and the cloud (mostly) stopped working
> > > > > > 
> > > > > > after some debugging i found (again) exchanges which had bindings to
> > > > 
> > > > queues, but these bindings didnt forward any msgs.
> > > > > > Wrote a small script to detect these broken bindings and will now check
> > > > 
> > > > if this is "reproducible"
> > > > > > 
> > > > > > then I will try "durable queues" and "durable queues with replication"
> > > > 
> > > > to see if this helps. Even if I would expect
> > > > > > rabbitmq should be able to handle this without these "hidden broken
> > > > 
> > > > bindings"
> > > > > > 
> > > > > > This just FYI.
> > > > > > 
> > > > > >  Fabian
> 
> 




More information about the openstack-discuss mailing list