Open Stack

Thu Aug 6 14:40:16 UTC 2020

Hey all,

I would like to ask the community about a rabbit issue we have from time
to time.

In our current architecture, we have a cluster of rabbits (3 nodes) for
all our OpenStack services (mostly nova and neutron).

When one node of this cluster is down, the cluster continue working (we
use pause_minority strategy).
But, sometimes, the third server is not able to recover automatically
and need a manual intervention.
After this intervention, we restart the rabbitmq-server process, which
is then able to join the cluster back.

At this time, the cluster looks ok, everything is fine.
BUT, nothing works.
Neutron and nova agents are not able to report back to servers.
They appear dead.
Servers seems not being able to consume messages.
The exchanges, queues, bindings seems good in rabbit.

What we see is that removing bindings (using rabbitmqadmin delete
binding or the web interface) and recreate them again (using the same
routing key) brings the service back up and running.

Doing this for all queues is really painful. Our next plan is to
automate it, but is there anyone in the community already saw this kind
of issues?

Our bug looks like the one described in [1].
Someone recommands to create an Alternate Exchange.
Is there anyone already tried that?

FYI, we are running rabbit 3.8.2 (with OpenStack Stein).
We had the same kind of issues using older version of rabbit.

Thanks for your help.

[1] https://groups.google.com/forum/#!newtopic/rabbitmq-users/rabbitmq-users/zFhmpHF2aWk

-- 
Arnaud Morin

Open Stack

[nova][neutron][oslo][ops] rabbit bindings issue

OpenStack

Community

Documentation

Branding & Legal