[nova][neutron][oslo][ops][kolla] rabbit bindings issue
Fabian Zimmermann
dev.faz at gmail.com
Mon Aug 17 14:03:39 UTC 2020
Just to keep the list updated.
If you run with durable_queues and replication, there is still a
possibility, that a short living queue will *not* jet be replicated
and a node failure will mark these queue as "unreachable". This
wouldnt be a problem, if openstack would create a new queue, but i
fear it would just try to reuse the existing after reconnect.
So, after all - it seems the less buggy way would be
* use durable-queue and replication for long-running queues/exchanges
* use non-durable-queue without replication for short (fanout, reply_) queues
This should allow the short-living ones to destroy themself on node
failure, and the long living ones should be able to be as available as
possible.
Absolutely untested - so use with caution, but here is a possible
policy-regex: ^(?!amq\.)(?!reply_)(?!.*fanout).*
Fabian
Am So., 16. Aug. 2020 um 15:37 Uhr schrieb Sean Mooney <smooney at redhat.com>:
>
> On Sat, 2020-08-15 at 20:13 -0400, Satish Patel wrote:
> > Hi Sean,
> >
> > Sounds good, but running rabbitmq for each service going to be little
> > overhead also, how do you scale cluster (Yes we can use cellv2 but its
> > not something everyone like to do because of complexity).
>
> my understanding is that when using rabbitmq adding multiple rabbitmq servers in a cluster lowers
> througput vs jsut 1 rabbitmq instance for any given excahnge. that is because the content of
> the queue need to be syconised across the cluster. so if cinder nova and neutron share
> a 3 node cluster and your compaure that to the same service deployed with cinder nova and neuton
> each having there on rabbitmq service then the independent deployment will tend to out perform the
> clustered solution. im not really sure if that has change i know tha thow clustering has been donw has evovled
> over the years but in the past clustering was the adversary of scaling.
>
> > If we thinks
> > rabbitMQ is growing pain then why community not looking for
> > alternative option (kafka) etc..?
> we have looked at alternivives several times
> rabbit mq wroks well enough ans scales well enough for most deployments.
> there other amqp implimantation that scale better then rabbit,
> activemq and qpid are both reported to scale better but they perfrom worse
> out of the box and need to be carfully tuned
>
> in the past zeromq has been supported but peole did not maintain it.
>
> kafka i dont think is a good alternative but nats https://nats.io/ might be.
>
> for what its worth all nova deployment are cellv2 deployments with 1 cell from around pike/rocky
> and its really not that complex. cells_v1 was much more complex bug part of the redesign
> for cells_v2 was makeing sure there is only 1 code path. adding a second cell just need another
> cell db and conductor to be deployed assuming you startted with a super conductor in the first
> place. the issue is cells is only a nova feature no other service have cells so it does not help
> you with cinder or neutron. as such cinder an neutron likely be the services that hit scaling limits first.
> adopign cells in other services is not nessaryally the right approch either but when we talk about scale
> we do need to keep in mind that cells is just for nova today.
>
>
> >
> > On Fri, Aug 14, 2020 at 3:09 PM Sean Mooney <smooney at redhat.com> wrote:
> > >
> > > On Fri, 2020-08-14 at 18:45 +0200, Fabian Zimmermann wrote:
> > > > Hi,
> > > >
> > > > i read somewhere that vexxhosts kubernetes openstack-Operator is running
> > > > one rabbitmq Container per Service. Just the kubernetes self healing is
> > > > used as "ha" for rabbitmq.
> > > >
> > > > That seems to match with my finding: run rabbitmq standalone and use an
> > > > external system to restart rabbitmq if required.
> > >
> > > thats the design that was orginally planned for kolla-kubernetes orrignally
> > >
> > > each service was to be deployed with its own rabbit mq server if it required one
> > > and if it crashed it woudl just be recreated by k8s. it perfromace better then a cluster
> > > and if you trust k8s or the external service enough to ensure it is recteated it
> > > should be as effective a solution. you dont even need k8s to do that but it seams to be
> > > a good fit if your prepared to ocationally loose inflight rpcs.
> > > if you not then you can configure rabbit to persite all message to disk and mont that on a shared
> > > file system like nfs or cephfs so that when the rabbit instance is recreated the queue contency is
> > > perserved. assuming you can take the perfromance hit of writing all messages to disk that is.
> > > >
> > > > Fabian
> > > >
> > > > Satish Patel <satish.txt at gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:
> > > >
> > > > > Fabian,
> > > > >
> > > > > what do you mean?
> > > > >
> > > > > > > I think vexxhost is running (1) with their openstack-operator - for
> > > > >
> > > > > reasons.
> > > > >
> > > > > On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz at gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Hello again,
> > > > > >
> > > > > > just a short update about the results of my tests.
> > > > > >
> > > > > > I currently see 2 ways of running openstack+rabbitmq
> > > > > >
> > > > > > 1. without durable-queues and without replication - just one
> > > > >
> > > > > rabbitmq-process which gets (somehow) restarted if it fails.
> > > > > > 2. durable-queues and replication
> > > > > >
> > > > > > Any other combination of these settings leads to more or less issues with
> > > > > >
> > > > > > * broken / non working bindings
> > > > > > * broken queues
> > > > > >
> > > > > > I think vexxhost is running (1) with their openstack-operator - for
> > > > >
> > > > > reasons.
> > > > > >
> > > > > > I added [kolla], because kolla-ansible is installing rabbitmq with
> > > > >
> > > > > replication but without durable-queues.
> > > > > >
> > > > > > May someone point me to the best way to document these findings to some
> > > > >
> > > > > official doc?
> > > > > > I think a lot of installations out there will run into issues if - under
> > > > >
> > > > > load - a node fails.
> > > > > >
> > > > > > Fabian
> > > > > >
> > > > > >
> > > > > > Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann <
> > > > >
> > > > > dev.faz at gmail.com>:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > just did some short tests today in our test-environment (without
> > > > >
> > > > > durable queues and without replication):
> > > > > > >
> > > > > > > * started a rally task to generate some load
> > > > > > > * kill-9-ed rabbitmq on one node
> > > > > > > * rally task immediately stopped and the cloud (mostly) stopped working
> > > > > > >
> > > > > > > after some debugging i found (again) exchanges which had bindings to
> > > > >
> > > > > queues, but these bindings didnt forward any msgs.
> > > > > > > Wrote a small script to detect these broken bindings and will now check
> > > > >
> > > > > if this is "reproducible"
> > > > > > >
> > > > > > > then I will try "durable queues" and "durable queues with replication"
> > > > >
> > > > > to see if this helps. Even if I would expect
> > > > > > > rabbitmq should be able to handle this without these "hidden broken
> > > > >
> > > > > bindings"
> > > > > > >
> > > > > > > This just FYI.
> > > > > > >
> > > > > > > Fabian
> >
> >
>
More information about the openstack-discuss
mailing list