Open Stack

Fri Jul 9 13:07:15 UTC 2021

Hi,

well, you should try to avoid these "split brains" as much as
possible, because rabbitmq is really bad in handling this (from my
experience).
So before trying to optimize rabbitmq - try to find the reason for these splits!

Nevertheless it seems (just my experience) - there are 2 ways of
running rabbitmq as stable as possible.

* Switch of HA-Queues/Replication and enable durable Queues

+ higher performance, less cpu-load as replication takes a lot power
- you will loose all msgs saved on a broken node (but openstack is
quite good in handling this)

or

* Switch on HA-Queues and disable durable Queues

+ only loosing msgs not jet replicated
- high cpu usage

both ways are not 100% failsafe.

Further points:

* Maybe you can have a look at the quorum-based queues of the latest
rabbitmq version (afaik nobody tested these for openstack?)
* https://github.com/devfaz/rabbitmq_debug - I wrote some scripts to
detect "misbehavior" of rabbitmq, you could implement the checks in
your cluster and hopefully - next time just have to cleanup your
queues without restarting all services.

 Fabian

Am Fr., 9. Juli 2021 um 10:04 Uhr schrieb Massimo Sgaravatto
<massimo.sgaravatto at gmail.com>:
>
> Dear all
>
> I am afraid I have something wrong in the setup of rabbit, and I would appreciate your help.
>
>
> We have a single rabbit cluster for all the OpenStack services.
> It is a cluster composed by 3 instances (now running rabbitmq-server-3.8.16 and erlang-24.0.2).
>
> We are using this setting:
>
> cluster_partition_handling = pause_minority
>
> to handle partitions, and we are using this policy:
>
>
> [root at cld-rbt-01 ~]# rabbitmqctl list_policies
> Listing policies for vhost "/" ...
> vhost   name     pattern   apply-to     definition      priority
> /       ha-all   ^(?!amq\.).*           all             {"ha-mode":"all"}       0
>
>
> In the conf files of the OpenStack services we have these settings related to Rabbit:
>
> transport_url = rabbit://openstack_prod:xxx@192.168.60.220:5672,openstack_prod:xxx@192.168.60.221:5672,openstack_prod:xxx@192.168.60.222:5672
> rabbit_ha_queues = true
>
>
>
> From time to time rabbit complains about some network partitions (still not clear why):
>
> 2021-07-02 08:12:55.715 [error] <0.463.0> Partial partition detected:
>  * We saw DOWN from rabbit at cld-rbt-02
>  * We can still see rabbit at cld-rbt-03 which can see rabbit at cld-rbt-02
>  * pause_minority mode enabled
> We will therefore pause until the *entire* cluster recovers
>
>
> and when this happens Openstack services are of course impacted.
>
> As soon as I can check the cluster status, rabbitmqctl cluster_status doesn't complain about any problems (i.e. it doesn't report any network partitions and it reports all nodes running) but the problems on the OpenStack services are still there (e.g. "neutron agent-list" report
> many agents down).
>
>
> I need to restart the rabbit cluster in order to have OpenStack services working again
>
> Any hints ?
>
> Thanks, Massimo

Open Stack

]ops] Something wrong with rabbit settings

OpenStack

Community

Documentation

Branding & Legal