[nova][neutron][oslo][ops][kolla] rabbit bindings issue

Fabian Zimmermann dev.faz at gmail.com
Mon Aug 17 14:21:34 UTC 2020


Hi,

oh, that's great!

So, someone at openstack-ansible already detected this and just forgot
to update the docs.openstack.org ;)

I tested my regex and it seems to fix my issue (atm).

I will run an openstack rally load test with the regex above to check
what happens if I terminate a rabbitmq while load is hitting the
system.

 Fabian

Am Mo., 17. Aug. 2020 um 16:17 Uhr schrieb Arnaud Morin
<arnaud.morin at gmail.com>:
>
> Hey Fabian,
>
> I was thinking the same, and I found the "default" values from
> openstack-ansible:
> https://github.com/openstack/openstack-ansible-rabbitmq_server/blob/fc27e735a68b64cb3c67dd8abeaf324803a9845b/defaults/main.yml#L172
>
> pattern: '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'
>
> Which are setting HA for all except
> amq.*
> *_fanout_*
> reply_*
>
> So that would make sense?
>
> --
> Arnaud Morin
>
> On 17.08.20 - 16:03, Fabian Zimmermann wrote:
> > Just to keep the list updated.
> >
> > If you run with durable_queues and replication, there is still a
> > possibility, that a short living queue will *not* jet be replicated
> > and a node failure will mark these queue as "unreachable". This
> > wouldnt be a problem, if openstack would create a new queue, but i
> > fear it would just try to reuse the existing after reconnect.
> >
> > So, after all - it seems the less buggy way would be
> >
> > * use durable-queue and replication for long-running queues/exchanges
> > * use non-durable-queue without replication for short (fanout, reply_) queues
> >
> > This should allow the short-living ones to destroy themself on node
> > failure, and the long living ones should be able to be as available as
> > possible.
> >
> > Absolutely untested - so use with caution, but here is a possible
> > policy-regex: ^(?!amq\.)(?!reply_)(?!.*fanout).*
> >
> >  Fabian
> >
> >
> > Am So., 16. Aug. 2020 um 15:37 Uhr schrieb Sean Mooney <smooney at redhat.com>:
> > >
> > > On Sat, 2020-08-15 at 20:13 -0400, Satish Patel wrote:
> > > > Hi Sean,
> > > >
> > > > Sounds good, but running rabbitmq for each service going to be little
> > > > overhead also, how do you scale cluster (Yes we can use cellv2 but its
> > > > not something everyone like to do because of complexity).
> > >
> > > my understanding is that when using rabbitmq adding multiple rabbitmq servers in a cluster lowers
> > > througput vs jsut 1 rabbitmq instance for any given excahnge. that is because the content of
> > > the queue need to be syconised across the cluster. so if cinder nova and neutron share
> > > a 3 node cluster and your compaure that to the same service deployed with cinder nova and neuton
> > > each having there on rabbitmq service then the independent deployment will tend to out perform the
> > > clustered solution. im not really sure if that has change i know tha thow clustering has been donw has evovled
> > > over the years but in the past clustering was the adversary of scaling.
> > >
> > > >  If we thinks
> > > > rabbitMQ is growing pain then why community not looking for
> > > > alternative option (kafka) etc..?
> > > we have looked at alternivives several times
> > > rabbit mq  wroks well enough ans scales well enough for most deployments.
> > > there other amqp implimantation that scale better then rabbit,
> > > activemq and qpid are both reported to scale better but they perfrom worse
> > > out of the box and need to be carfully tuned
> > >
> > > in the past zeromq has been supported but peole did not maintain it.
> > >
> > > kafka i dont think is a good alternative but nats https://nats.io/ might be.
> > >
> > > for what its worth all nova deployment are cellv2 deployments with 1 cell from around pike/rocky
> > > and its really not that complex. cells_v1 was much more complex bug part of the redesign
> > > for cells_v2 was makeing sure there is only 1 code path. adding a second cell just need another
> > > cell db and conductor to be deployed assuming you startted with a super conductor in the first
> > > place. the issue is cells is only a nova feature no other service have cells so it does not help
> > > you with cinder or neutron. as such cinder an neutron likely be the services that hit scaling limits first.
> > > adopign cells in other services is not nessaryally the right approch either but when we talk about scale
> > > we do need to keep in mind that cells is just for nova today.
> > >
> > >
> > > >
> > > > On Fri, Aug 14, 2020 at 3:09 PM Sean Mooney <smooney at redhat.com> wrote:
> > > > >
> > > > > On Fri, 2020-08-14 at 18:45 +0200, Fabian Zimmermann wrote:
> > > > > > Hi,
> > > > > >
> > > > > > i read somewhere that vexxhosts kubernetes openstack-Operator is running
> > > > > > one rabbitmq Container per Service. Just the kubernetes self healing is
> > > > > > used as "ha" for rabbitmq.
> > > > > >
> > > > > > That seems to match with my finding: run rabbitmq standalone and use an
> > > > > > external system to restart rabbitmq if required.
> > > > >
> > > > > thats the design that was orginally planned for kolla-kubernetes orrignally
> > > > >
> > > > > each service was to be deployed with its own rabbit mq server if it required one
> > > > > and if it crashed it woudl just be recreated by k8s. it perfromace better then a cluster
> > > > > and if you trust k8s or the external service enough to ensure it is recteated it
> > > > > should be as effective a solution. you dont even need k8s to do that but it seams to be
> > > > > a good fit if  your prepared to ocationally loose inflight rpcs.
> > > > > if you not then you can configure rabbit to persite all message to disk and mont that on a shared
> > > > > file system like nfs or cephfs so that when the rabbit instance is recreated the queue contency is
> > > > > perserved. assuming you can take the perfromance hit of writing all messages to disk that is.
> > > > > >
> > > > > >  Fabian
> > > > > >
> > > > > > Satish Patel <satish.txt at gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:
> > > > > >
> > > > > > > Fabian,
> > > > > > >
> > > > > > > what do you mean?
> > > > > > >
> > > > > > > > > I think vexxhost is running (1) with their openstack-operator - for
> > > > > > >
> > > > > > > reasons.
> > > > > > >
> > > > > > > On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz at gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hello again,
> > > > > > > >
> > > > > > > > just a short update about the results of my tests.
> > > > > > > >
> > > > > > > > I currently see 2 ways of running openstack+rabbitmq
> > > > > > > >
> > > > > > > > 1. without durable-queues and without replication - just one
> > > > > > >
> > > > > > > rabbitmq-process which gets (somehow) restarted if it fails.
> > > > > > > > 2. durable-queues and replication
> > > > > > > >
> > > > > > > > Any other combination of these settings leads to more or less issues with
> > > > > > > >
> > > > > > > > * broken / non working bindings
> > > > > > > > * broken queues
> > > > > > > >
> > > > > > > > I think vexxhost is running (1) with their openstack-operator - for
> > > > > > >
> > > > > > > reasons.
> > > > > > > >
> > > > > > > > I added [kolla], because kolla-ansible is installing rabbitmq with
> > > > > > >
> > > > > > > replication but without durable-queues.
> > > > > > > >
> > > > > > > > May someone point me to the best way to document these findings to some
> > > > > > >
> > > > > > > official doc?
> > > > > > > > I think a lot of installations out there will run into issues if - under
> > > > > > >
> > > > > > > load - a node fails.
> > > > > > > >
> > > > > > > >  Fabian
> > > > > > > >
> > > > > > > >
> > > > > > > > Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann <
> > > > > > >
> > > > > > > dev.faz at gmail.com>:
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > just did some short tests today in our test-environment (without
> > > > > > >
> > > > > > > durable queues and without replication):
> > > > > > > > >
> > > > > > > > > * started a rally task to generate some load
> > > > > > > > > * kill-9-ed rabbitmq on one node
> > > > > > > > > * rally task immediately stopped and the cloud (mostly) stopped working
> > > > > > > > >
> > > > > > > > > after some debugging i found (again) exchanges which had bindings to
> > > > > > >
> > > > > > > queues, but these bindings didnt forward any msgs.
> > > > > > > > > Wrote a small script to detect these broken bindings and will now check
> > > > > > >
> > > > > > > if this is "reproducible"
> > > > > > > > >
> > > > > > > > > then I will try "durable queues" and "durable queues with replication"
> > > > > > >
> > > > > > > to see if this helps. Even if I would expect
> > > > > > > > > rabbitmq should be able to handle this without these "hidden broken
> > > > > > >
> > > > > > > bindings"
> > > > > > > > >
> > > > > > > > > This just FYI.
> > > > > > > > >
> > > > > > > > >  Fabian
> > > >
> > > >
> > >
> >



More information about the openstack-discuss mailing list