]ops] Something wrong with rabbit settings

9 Jul 2021

      Dear all

I am afraid I have something wrong in the setup of rabbit, and I would
appreciate your help.

We have a single rabbit cluster for all the OpenStack services.
It is a cluster composed by 3 instances (now running rabbitmq-server-3.8.16
and erlang-24.0.2).

We are using this setting:

cluster_partition_handling = pause_minority

to handle partitions, and we are using this policy:

[root@cld-rbt-01 ~]# rabbitmqctl list_policies
Listing policies for vhost "/" ...
vhost   name     pattern   apply-to     definition      priority
/       ha-all   ^(?!amq\.).*           all             {"ha-mode":"all"}
    0

In the conf files of the OpenStack services we have these settings related
to Rabbit:

transport_url = rabbit://openstack_prod:xxx@192.168.60.220:5672,
openstack_prod:xxx@192.168.60.221:5672,
openstack_prod:xxx@192.168.60.222:5672
rabbit_ha_queues = true
...
From time to time rabbit complains about some network partitions (still not
clear why):
2021-07-02 08:12:55.715 [error] <0.463.0> Partial partition detected:
 * We saw DOWN from rabbit@cld-rbt-02
 * We can still see rabbit@cld-rbt-03 which can see rabbit@cld-rbt-02
 * pause_minority mode enabled
We will therefore pause until the *entire* cluster recovers

and when this happens Openstack services are of course impacted.

As soon as I can check the cluster status, rabbitmqctl cluster_status
doesn't complain about any problems (i.e. it doesn't report any network
partitions and it reports all nodes running) but the problems on the
OpenStack services are still there (e.g. "neutron agent-list" report
many agents down).

I need to restart the rabbit cluster in order to have OpenStack services
working again

Any hints ?

Thanks, Massimo

]ops] Something wrong with rabbit settings

Massimo Sgaravatto