]ops] Something wrong with rabbit settings
Dear all I am afraid I have something wrong in the setup of rabbit, and I would appreciate your help. We have a single rabbit cluster for all the OpenStack services. It is a cluster composed by 3 instances (now running rabbitmq-server-3.8.16 and erlang-24.0.2). We are using this setting: cluster_partition_handling = pause_minority to handle partitions, and we are using this policy: [root@cld-rbt-01 ~]# rabbitmqctl list_policies Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* all {"ha-mode":"all"} 0 In the conf files of the OpenStack services we have these settings related to Rabbit: transport_url = rabbit://openstack_prod:xxx@192.168.60.220:5672, openstack_prod:xxx@192.168.60.221:5672, openstack_prod:xxx@192.168.60.222:5672 rabbit_ha_queues = true
From time to time rabbit complains about some network partitions (still not clear why):
2021-07-02 08:12:55.715 [error] <0.463.0> Partial partition detected: * We saw DOWN from rabbit@cld-rbt-02 * We can still see rabbit@cld-rbt-03 which can see rabbit@cld-rbt-02 * pause_minority mode enabled We will therefore pause until the *entire* cluster recovers and when this happens Openstack services are of course impacted. As soon as I can check the cluster status, rabbitmqctl cluster_status doesn't complain about any problems (i.e. it doesn't report any network partitions and it reports all nodes running) but the problems on the OpenStack services are still there (e.g. "neutron agent-list" report many agents down). I need to restart the rabbit cluster in order to have OpenStack services working again Any hints ? Thanks, Massimo
Hi, well, you should try to avoid these "split brains" as much as possible, because rabbitmq is really bad in handling this (from my experience). So before trying to optimize rabbitmq - try to find the reason for these splits! Nevertheless it seems (just my experience) - there are 2 ways of running rabbitmq as stable as possible. * Switch of HA-Queues/Replication and enable durable Queues + higher performance, less cpu-load as replication takes a lot power - you will loose all msgs saved on a broken node (but openstack is quite good in handling this) or * Switch on HA-Queues and disable durable Queues + only loosing msgs not jet replicated - high cpu usage both ways are not 100% failsafe. Further points: * Maybe you can have a look at the quorum-based queues of the latest rabbitmq version (afaik nobody tested these for openstack?) * https://github.com/devfaz/rabbitmq_debug - I wrote some scripts to detect "misbehavior" of rabbitmq, you could implement the checks in your cluster and hopefully - next time just have to cleanup your queues without restarting all services. Fabian Am Fr., 9. Juli 2021 um 10:04 Uhr schrieb Massimo Sgaravatto <massimo.sgaravatto@gmail.com>:
Dear all
I am afraid I have something wrong in the setup of rabbit, and I would appreciate your help.
We have a single rabbit cluster for all the OpenStack services. It is a cluster composed by 3 instances (now running rabbitmq-server-3.8.16 and erlang-24.0.2).
We are using this setting:
cluster_partition_handling = pause_minority
to handle partitions, and we are using this policy:
[root@cld-rbt-01 ~]# rabbitmqctl list_policies Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* all {"ha-mode":"all"} 0
In the conf files of the OpenStack services we have these settings related to Rabbit:
transport_url = rabbit://openstack_prod:xxx@192.168.60.220:5672,openstack_prod:xxx@192.168.60.221:5672,openstack_prod:xxx@192.168.60.222:5672 rabbit_ha_queues = true
From time to time rabbit complains about some network partitions (still not clear why):
2021-07-02 08:12:55.715 [error] <0.463.0> Partial partition detected: * We saw DOWN from rabbit@cld-rbt-02 * We can still see rabbit@cld-rbt-03 which can see rabbit@cld-rbt-02 * pause_minority mode enabled We will therefore pause until the *entire* cluster recovers
and when this happens Openstack services are of course impacted.
As soon as I can check the cluster status, rabbitmqctl cluster_status doesn't complain about any problems (i.e. it doesn't report any network partitions and it reports all nodes running) but the problems on the OpenStack services are still there (e.g. "neutron agent-list" report many agents down).
I need to restart the rabbit cluster in order to have OpenStack services working again
Any hints ?
Thanks, Massimo
On Fri, 2021-07-09 at 15:07 +0200, Fabian Zimmermann wrote:
Hi,
well, you should try to avoid these "split brains" as much as possible, because rabbitmq is really bad in handling this (from my experience). So before trying to optimize rabbitmq - try to find the reason for these splits! yep i cant agree more.
Nevertheless it seems (just my experience) - there are 2 ways of running rabbitmq as stable as possible.
* Switch of HA-Queues/Replication and enable durable Queues
+ higher performance, less cpu-load as replication takes a lot power - you will loose all msgs saved on a broken node (but openstack is quite good in handling this) actully i dont really think it is.
at lwast form a nova perspective if we send an cast for example from the api its lost then we wont try to recover. in the case of an rpc call then the timeout will fire and we will fail whatever operation we were doing so if by "openstack is quite good at handeling this" you mean that we will try to do nothing to recover and jsut fail then yes. but lossing rpc messages can end up with vms stuck in building or deleting ectra and will requrie the operator to go fix that. we have started to use thinks like the mandaroyt flag to ensure that its at least enqueue recently but in general we assume the rpc bus is perfect in most cases and dont try to add addtional handelign for lost requests.
or
* Switch on HA-Queues and disable durable Queues
+ only loosing msgs not jet replicated - high cpu usage
this sounds liek the better solution then useing durable queue whifch also have a perfromacne hit if you have slow io on the rabbit host.
both ways are not 100% failsafe.
Further points:
* Maybe you can have a look at the quorum-based queues of the latest rabbitmq version (afaik nobody tested these for openstack?)
i had not heard fo them until now but they seam interesting https://www.rabbitmq.com/quorum-queues.html#feature-matrix im not 100% sure if they woudl fit our usecase but they may in fact have some feature that would be beifical.
* https://github.com/devfaz/rabbitmq_debug - I wrote some scripts to detect "misbehavior" of rabbitmq, you could implement the checks in your cluster and hopefully - next time just have to cleanup your queues without restarting all services.
Fabian
Am Fr., 9. Juli 2021 um 10:04 Uhr schrieb Massimo Sgaravatto <massimo.sgaravatto@gmail.com>:
Dear all
I am afraid I have something wrong in the setup of rabbit, and I would appreciate your help.
We have a single rabbit cluster for all the OpenStack services. It is a cluster composed by 3 instances (now running rabbitmq-server-3.8.16 and erlang-24.0.2).
We are using this setting:
cluster_partition_handling = pause_minority
to handle partitions, and we are using this policy:
[root@cld-rbt-01 ~]# rabbitmqctl list_policies Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* all {"ha-mode":"all"} 0
In the conf files of the OpenStack services we have these settings related to Rabbit:
transport_url = rabbit://openstack_prod:xxx@192.168.60.220:5672,openstack_prod:xxx@192.168.60.221:5672,openstack_prod:xxx@192.168.60.222:5672 rabbit_ha_queues = true
From time to time rabbit complains about some network partitions (still not clear why):
2021-07-02 08:12:55.715 [error] <0.463.0> Partial partition detected: * We saw DOWN from rabbit@cld-rbt-02 * We can still see rabbit@cld-rbt-03 which can see rabbit@cld-rbt-02 * pause_minority mode enabled We will therefore pause until the *entire* cluster recovers
and when this happens Openstack services are of course impacted.
As soon as I can check the cluster status, rabbitmqctl cluster_status doesn't complain about any problems (i.e. it doesn't report any network partitions and it reports all nodes running) but the problems on the OpenStack services are still there (e.g. "neutron agent-list" report many agents down).
I need to restart the rabbit cluster in order to have OpenStack services working again
Any hints ?
Thanks, Massimo
Hi, Am Fr., 9. Juli 2021 um 16:04 Uhr schrieb Sean Mooney <smooney@redhat.com>:
at lwast form a nova perspective if we send an cast for example from the api its lost then we wont try to recover.
in the case of an rpc call then the timeout will fire and we will fail whatever operation we were doing
well its a lot better to have consistent state with a limited amount of failed requests, than having an whole cluster stuck and it normally affects only a limited (if any!) requests at all. So I personally prefer - fail fast and restore :) Fabian
Thanks a lot for your replies I indeed forgot to say that I am using durable queues (i.e I set amqp_durable_queues = true in the conf files of the OpenStack services). I'll investigate further about the root cause of these network partitions, but I implemented this rabbit cluster exactly to be able to manage such scenarios ... Looks like I can have a much more reliable system with a single rabbit instance ... Moreover: is it normal/expected that it doesn't recover itself ? Thanks, Massimo On Fri, Jul 9, 2021 at 4:21 PM Fabian Zimmermann <dev.faz@gmail.com> wrote:
Hi,
Am Fr., 9. Juli 2021 um 16:04 Uhr schrieb Sean Mooney <smooney@redhat.com
:
at lwast form a nova perspective if we send an cast for example from the api its lost then we wont try to recover.
in the case of an rpc call then the timeout will fire and we will fail whatever operation we were doing
well its a lot better to have consistent state with a limited amount of failed requests, than having an whole cluster stuck and it normally affects only a limited (if any!) requests at all. So I personally prefer - fail fast and restore :)
Fabian
On 7/12/21 9:20 AM, Massimo Sgaravatto wrote:
Thanks a lot for your replies
I indeed forgot to say that I am using durable queues (i.e I set amqp_durable_queues = true in the conf files of the OpenStack services).
I'll investigate further about the root cause of these network partitions, but I implemented this rabbit cluster exactly to be able to manage such scenarios ... Looks like I can have a much more reliable system with a single rabbit instance ...
Moreover: is it normal/expected that it doesn't recover itself ?
There is pacemaker OCF RA [0] that automatically recovers from network partitions, mostly by resetting the Mnesia DB of failed nodes that cannot join. [0] https://www.rabbitmq.com/pacemaker.html#auto-pacemaker
Thanks, Massimo
On Fri, Jul 9, 2021 at 4:21 PM Fabian Zimmermann <dev.faz@gmail.com <mailto:dev.faz@gmail.com>> wrote:
Hi,
Am Fr., 9. Juli 2021 um 16:04 Uhr schrieb Sean Mooney <smooney@redhat.com <mailto:smooney@redhat.com>>:
> at lwast form a nova perspective if we send an cast for example from the api > its lost then we wont try to recover. > > in the case of an rpc call then the timeout will fire and we will fail whatever operation we were doing
well its a lot better to have consistent state with a limited amount of failed requests, than having an whole cluster stuck and it normally affects only a limited (if any!) requests at all. So I personally prefer - fail fast and restore :)
Fabian
-- Best regards, Bogdan Dobrelya, Irc #bogdando
participants (4)
-
Bogdan Dobrelya
-
Fabian Zimmermann
-
Massimo Sgaravatto
-
Sean Mooney