Dear all I am afraid I have something wrong in the setup of rabbit, and I would appreciate your help. We have a single rabbit cluster for all the OpenStack services. It is a cluster composed by 3 instances (now running rabbitmq-server-3.8.16 and erlang-24.0.2). We are using this setting: cluster_partition_handling = pause_minority to handle partitions, and we are using this policy: [root@cld-rbt-01 ~]# rabbitmqctl list_policies Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* all {"ha-mode":"all"} 0 In the conf files of the OpenStack services we have these settings related to Rabbit: transport_url = rabbit://openstack_prod:xxx@192.168.60.220:5672, openstack_prod:xxx@192.168.60.221:5672, openstack_prod:xxx@192.168.60.222:5672 rabbit_ha_queues = true
From time to time rabbit complains about some network partitions (still not clear why):
2021-07-02 08:12:55.715 [error] <0.463.0> Partial partition detected: * We saw DOWN from rabbit@cld-rbt-02 * We can still see rabbit@cld-rbt-03 which can see rabbit@cld-rbt-02 * pause_minority mode enabled We will therefore pause until the *entire* cluster recovers and when this happens Openstack services are of course impacted. As soon as I can check the cluster status, rabbitmqctl cluster_status doesn't complain about any problems (i.e. it doesn't report any network partitions and it reports all nodes running) but the problems on the OpenStack services are still there (e.g. "neutron agent-list" report many agents down). I need to restart the rabbit cluster in order to have OpenStack services working again Any hints ? Thanks, Massimo