[kolla] RabbitMQ High Availability

older
[nova] nova hypervisor oom killed...

Tan Tran Trong

21 Jul 2022 21 Jul '22

10:32 a.m.

Hello, I'm trying to figure out how to configure RabbitMQ to make it high available. I have 3 controller nodes and 2 compute nodes, deployed with kolla with mostly default configuration. The RabbitMQ set to ha-all for all queues on all nodes, amqp_durable_queues = True My problem is when I shutdown 1 controller node (or 1 RabbitMQ container) (master or slave) the whole cluster becomes unstable. Some instances can not be created, it is stuck on Scheduling, Block Device Mapping, the volumes not shown or are stuck on creating, the compute node reported dead randomly,... I'm looking for documentation to know how Openstack using RabbitMQ, Openstack behavior when RabbitMQ node down and way to make RabbitMQ HA in a stable way. Do you have any recommendation? TIA, Tan

Attachments:

attachment.html (text/html — 871 bytes)

Show replies by date

Doug Szumski

22 Jul 22 Jul

10:29 a.m.

On 21/07/2022 11:32, Tan Tran Trong wrote:

...

Hello, I'm trying to figure out how to configure RabbitMQ to make it high available. I have 3 controller nodes and 2 compute nodes, deployed with kolla with mostly default configuration. The RabbitMQ set to ha-all for all queues on all nodes, amqp_durable_queues = True My problem is when I shutdown 1 controller node (or 1 RabbitMQ container) (master or slave) the whole cluster becomes unstable. Some instances can not be created, it is stuck on Scheduling, Block Device Mapping, the volumes not shown or are stuck on creating, the compute node reported dead randomly,... I'm looking for documentation to know how Openstack using RabbitMQ, Openstack behavior when RabbitMQ node down and way to make RabbitMQ HA in a stable way. Do you have any recommendation?

Would it be possible to compare with this approach of running a clustered Rabbit service, but without mirrored (and durable) queues? https://review.opendev.org/c/openstack/kolla-ansible/+/824994 It won't solve all failure scenarios, but we have seen it help with controlled shutdowns. We'd be interested in any failure scenarios you find with those settings.

...

TIA, Tan

Albert Braden

3:53 p.m.

The default RMQ config is broken. You're on the right track with setting durable_queues, but there's more to do. I'm running kolla Train with mirrored/durable queues and my clusters work fine with a controller down. One issue that we faced after setting durable was that we weren't running redis, and then when we tried to run it the network was blocking the port, but eventually we got it working. Some have recommended not mirroring queues; I haven't tried that. If anyone has successfully setup HA without mirrored queues, I'd be interested to hear about how you did it. Here are some helpful links: https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit https://lists.openstack.org/pipermail/openstack-discuss/2021-November/026074... https://lists.openstack.org/pipermail/openstack-discuss/2020-August/016362.h... https://lists.openstack.org/pipermail/openstack-discuss/2020-August/016524.h... https://review.opendev.org/c/openstack/kolla-ansible/+/822191 https://review.opendev.org/c/openstack/kolla-ansible/+/824994 On Thursday, July 21, 2022, 02:42:42 PM EDT, Tan Tran Trong <gk.coltech@gmail.com> wrote: Hello,I'm trying to figure out how to configure RabbitMQ to make it high available. I have 3 controller nodes and 2 compute nodes, deployed with kolla with mostly default configuration. The RabbitMQ set to ha-all for all queues on all nodes, amqp_durable_queues = TrueMy problem is when I shutdown 1 controller node (or 1 RabbitMQ container) (master or slave) the whole cluster becomes unstable. Some instances can not be created, it is stuck on Scheduling, Block Device Mapping, the volumes not shown or are stuck on creating, the compute node reported dead randomly,...I'm looking for documentation to know how Openstack using RabbitMQ, Openstack behavior when RabbitMQ node down and way to make RabbitMQ HA in a stable way. Do you have any recommendation? TIA,Tan

Tan Tran Trong

23 Jul 23 Jul

5:18 p.m.

Hello, Thank you guys for your links. Actually I moved from no durable queues + no HA policy to durable queues + ha-all policy. The result is still the same. Tried to turning using https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit but still missing something I guess. @Albert: Have you tested the case when you shutdown 1 controller -> thing works -> power it on -> shutdown another controller? In my case the cluster is not stable after that. And by "work fine" you mean you don't have to do anything (restart rabbitmq, restart openstack services) when 1 controller is down, do you? I know it sounds silly, but we end up using internal keepalived VIP only for all transport settings which remove loadbalancing but keep my cluster stable when 1 node down, really don't know if it will cause trouble later when the cluster grows. Regards, Tan On Fri, Jul 22, 2022 at 10:53 PM Albert Braden <ozzzo@yahoo.com> wrote:

...

The default RMQ config is broken. You're on the right track with setting durable_queues, but there's more to do. I'm running kolla Train with mirrored/durable queues and my clusters work fine with a controller down. One issue that we faced after setting durable was that we weren't running redis, and then when we tried to run it the network was blocking the port, but eventually we got it working.

Some have recommended not mirroring queues; I haven't tried that. If anyone has successfully setup HA without mirrored queues, I'd be interested to hear about how you did it.

Here are some helpful links:

https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit

https://lists.openstack.org/pipermail/openstack-discuss/2021-November/026074...

https://lists.openstack.org/pipermail/openstack-discuss/2020-August/016362.h...

https://lists.openstack.org/pipermail/openstack-discuss/2020-August/016524.h... https://review.opendev.org/c/openstack/kolla-ansible/+/822191 https://review.opendev.org/c/openstack/kolla-ansible/+/824994 On Thursday, July 21, 2022, 02:42:42 PM EDT, Tan Tran Trong < gk.coltech@gmail.com> wrote:

Hello, I'm trying to figure out how to configure RabbitMQ to make it high available. I have 3 controller nodes and 2 compute nodes, deployed with kolla with mostly default configuration. The RabbitMQ set to ha-all for all queues on all nodes, amqp_durable_queues = True My problem is when I shutdown 1 controller node (or 1 RabbitMQ container) (master or slave) the whole cluster becomes unstable. Some instances can not be created, it is stuck on Scheduling, Block Device Mapping, the volumes not shown or are stuck on creating, the compute node reported dead randomly,... I'm looking for documentation to know how Openstack using RabbitMQ, Openstack behavior when RabbitMQ node down and way to make RabbitMQ HA in a stable way. Do you have any recommendation?

TIA, Tan

Satish Patel

6:20 p.m.

Something is wrong with your version or rabbitMQ version. Make sure you are not dealing with bug. I have 3 node cluster and it always survive if I shutdown one of controller node. It works prefect fine without issue. Even with HA or nonHA config. What version of openstack and rabbitMQ are you running ? Sent from my iPhone

...

On Jul 23, 2022, at 1:29 PM, Tan Tran Trong <gk.coltech@gmail.com> wrote:

Hello, Thank you guys for your links. Actually I moved from no durable queues + no HA policy to durable queues + ha-all policy. The result is still the same. Tried to turning using https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit but still missing something I guess. @Albert: Have you tested the case when you shutdown 1 controller -> thing works -> power it on -> shutdown another controller? In my case the cluster is not stable after that. And by "work fine" you mean you don't have to do anything (restart rabbitmq, restart openstack services) when 1 controller is down, do you? I know it sounds silly, but we end up using internal keepalived VIP only for all transport settings which remove loadbalancing but keep my cluster stable when 1 node down, really don't know if it will cause trouble later when the cluster grows.

Regards, Tan

...
On Fri, Jul 22, 2022 at 10:53 PM Albert Braden <ozzzo@yahoo.com> wrote: The default RMQ config is broken. You're on the right track with setting durable_queues, but there's more to do. I'm running kolla Train with mirrored/durable queues and my clusters work fine with a controller down. One issue that we faced after setting durable was that we weren't running redis, and then when we tried to run it the network was blocking the port, but eventually we got it working.

Some have recommended not mirroring queues; I haven't tried that. If anyone has successfully setup HA without mirrored queues, I'd be interested to hear about how you did it.

Here are some helpful links:

https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit https://lists.openstack.org/pipermail/openstack-discuss/2021-November/026074... https://lists.openstack.org/pipermail/openstack-discuss/2020-August/016362.h... https://lists.openstack.org/pipermail/openstack-discuss/2020-August/016524.h... https://review.opendev.org/c/openstack/kolla-ansible/+/822191 https://review.opendev.org/c/openstack/kolla-ansible/+/824994 On Thursday, July 21, 2022, 02:42:42 PM EDT, Tan Tran Trong <gk.coltech@gmail.com> wrote:

Hello, I'm trying to figure out how to configure RabbitMQ to make it high available. I have 3 controller nodes and 2 compute nodes, deployed with kolla with mostly default configuration. The RabbitMQ set to ha-all for all queues on all nodes, amqp_durable_queues = True My problem is when I shutdown 1 controller node (or 1 RabbitMQ container) (master or slave) the whole cluster becomes unstable. Some instances can not be created, it is stuck on Scheduling, Block Device Mapping, the volumes not shown or are stuck on creating, the compute node reported dead randomly,... I'm looking for documentation to know how Openstack using RabbitMQ, Openstack behavior when RabbitMQ node down and way to make RabbitMQ HA in a stable way. Do you have any recommendation?

TIA, Tan

Tan Tran Trong

25 Jul 25 Jul

3:43 a.m.

Hi, My RMQ version is: 3.8.32 I deployed the xena version using kolla-ansible on Ubuntu 20.04. Right now my cluster running no ha + amqp_durable_queues = False, when I shut 1 controller and create instance I got the error on nova-scheduler: 2022-07-25 10:36:41.496 688 ERROR root oslo_messaging.exceptions.MessageDeliveryFailure: Unable to connect to AMQP server on x.x.x.x:5672 after inf tries: Queue.declare: (404) NOT_FOUND - home node 'rabbit@control02' of durable queue 'scheduler' in vhost '/' is down or inaccessible Regards, Tan On Sun, Jul 24, 2022 at 1:20 AM Satish Patel <satish.txt@gmail.com> wrote:

...

Something is wrong with your version or rabbitMQ version. Make sure you are not dealing with bug. I have 3 node cluster and it always survive if I shutdown one of controller node. It works prefect fine without issue. Even with HA or nonHA config.

What version of openstack and rabbitMQ are you running ?

Sent from my iPhone

On Jul 23, 2022, at 1:29 PM, Tan Tran Trong <gk.coltech@gmail.com> wrote:

Hello, Thank you guys for your links. Actually I moved from no durable queues + no HA policy to durable queues + ha-all policy. The result is still the same. Tried to turning using https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit but still missing something I guess. @Albert: Have you tested the case when you shutdown 1 controller -> thing works -> power it on -> shutdown another controller? In my case the cluster is not stable after that. And by "work fine" you mean you don't have to do anything (restart rabbitmq, restart openstack services) when 1 controller is down, do you? I know it sounds silly, but we end up using internal keepalived VIP only for all transport settings which remove loadbalancing but keep my cluster stable when 1 node down, really don't know if it will cause trouble later when the cluster grows.

Regards, Tan

On Fri, Jul 22, 2022 at 10:53 PM Albert Braden <ozzzo@yahoo.com> wrote:

...
The default RMQ config is broken. You're on the right track with setting durable_queues, but there's more to do. I'm running kolla Train with mirrored/durable queues and my clusters work fine with a controller down. One issue that we faced after setting durable was that we weren't running redis, and then when we tried to run it the network was blocking the port, but eventually we got it working.

Some have recommended not mirroring queues; I haven't tried that. If anyone has successfully setup HA without mirrored queues, I'd be interested to hear about how you did it.

Here are some helpful links:

https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit

https://lists.openstack.org/pipermail/openstack-discuss/2021-November/026074...

https://lists.openstack.org/pipermail/openstack-discuss/2020-August/016362.h...

https://lists.openstack.org/pipermail/openstack-discuss/2020-August/016524.h... https://review.opendev.org/c/openstack/kolla-ansible/+/822191 https://review.opendev.org/c/openstack/kolla-ansible/+/824994 On Thursday, July 21, 2022, 02:42:42 PM EDT, Tan Tran Trong < gk.coltech@gmail.com> wrote:

Hello, I'm trying to figure out how to configure RabbitMQ to make it high available. I have 3 controller nodes and 2 compute nodes, deployed with kolla with mostly default configuration. The RabbitMQ set to ha-all for all queues on all nodes, amqp_durable_queues = True My problem is when I shutdown 1 controller node (or 1 RabbitMQ container) (master or slave) the whole cluster becomes unstable. Some instances can not be created, it is stuck on Scheduling, Block Device Mapping, the volumes not shown or are stuck on creating, the compute node reported dead randomly,... I'm looking for documentation to know how Openstack using RabbitMQ, Openstack behavior when RabbitMQ node down and way to make RabbitMQ HA in a stable way. Do you have any recommendation?

TIA, Tan

Fabian Zimmermann

26 Jul 26 Jul

12:23 p.m.

Hi, try to upgrade your rabbitmq - as experience tells, that rabbitmq gets better in every version. In the above posts on the list, we did some tests and the results got written into https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit To summarize, the most stable config seems to be: durable queues + ha-replication *only* for long living queues and non-replicated for the short living ones. Its also the config used by the openstack-ansible project. This is mostly reached by rabbitmq-policies as documented in the wiki. If you also have issues while all 3 nodes are running, it may be usefull to clear your rabbitmq/vhost/mnesia and start from a clean data-dir. Fabian Am Mo., 25. Juli 2022 um 06:01 Uhr schrieb Tan Tran Trong <gk.coltech@gmail.com>:

...

Hi, My RMQ version is: 3.8.32 I deployed the xena version using kolla-ansible on Ubuntu 20.04. Right now my cluster running no ha + amqp_durable_queues = False, when I shut 1 controller and create instance I got the error on nova-scheduler: 2022-07-25 10:36:41.496 688 ERROR root oslo_messaging.exceptions.MessageDeliveryFailure: Unable to connect to AMQP server on x.x.x.x:5672 after inf tries: Queue.declare: (404) NOT_FOUND - home node 'rabbit@control02' of durable queue 'scheduler' in vhost '/' is down or inaccessible

Regards, Tan

On Sun, Jul 24, 2022 at 1:20 AM Satish Patel <satish.txt@gmail.com> wrote:

...
Something is wrong with your version or rabbitMQ version. Make sure you are not dealing with bug. I have 3 node cluster and it always survive if I shutdown one of controller node. It works prefect fine without issue. Even with HA or nonHA config.

What version of openstack and rabbitMQ are you running ?

Sent from my iPhone

On Jul 23, 2022, at 1:29 PM, Tan Tran Trong <gk.coltech@gmail.com> wrote:

Hello, Thank you guys for your links. Actually I moved from no durable queues + no HA policy to durable queues + ha-all policy. The result is still the same. Tried to turning using https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit but still missing something I guess. @Albert: Have you tested the case when you shutdown 1 controller -> thing works -> power it on -> shutdown another controller? In my case the cluster is not stable after that. And by "work fine" you mean you don't have to do anything (restart rabbitmq, restart openstack services) when 1 controller is down, do you? I know it sounds silly, but we end up using internal keepalived VIP only for all transport settings which remove loadbalancing but keep my cluster stable when 1 node down, really don't know if it will cause trouble later when the cluster grows.

Regards, Tan

On Fri, Jul 22, 2022 at 10:53 PM Albert Braden <ozzzo@yahoo.com> wrote:

...
The default RMQ config is broken. You're on the right track with setting durable_queues, but there's more to do. I'm running kolla Train with mirrored/durable queues and my clusters work fine with a controller down. One issue that we faced after setting durable was that we weren't running redis, and then when we tried to run it the network was blocking the port, but eventually we got it working.

Some have recommended not mirroring queues; I haven't tried that. If anyone has successfully setup HA without mirrored queues, I'd be interested to hear about how you did it.

Here are some helpful links:

https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit https://lists.openstack.org/pipermail/openstack-discuss/2021-November/026074... https://lists.openstack.org/pipermail/openstack-discuss/2020-August/016362.h... https://lists.openstack.org/pipermail/openstack-discuss/2020-August/016524.h... https://review.opendev.org/c/openstack/kolla-ansible/+/822191 https://review.opendev.org/c/openstack/kolla-ansible/+/824994 On Thursday, July 21, 2022, 02:42:42 PM EDT, Tan Tran Trong <gk.coltech@gmail.com> wrote:

Hello, I'm trying to figure out how to configure RabbitMQ to make it high available. I have 3 controller nodes and 2 compute nodes, deployed with kolla with mostly default configuration. The RabbitMQ set to ha-all for all queues on all nodes, amqp_durable_queues = True My problem is when I shutdown 1 controller node (or 1 RabbitMQ container) (master or slave) the whole cluster becomes unstable. Some instances can not be created, it is stuck on Scheduling, Block Device Mapping, the volumes not shown or are stuck on creating, the compute node reported dead randomly,... I'm looking for documentation to know how Openstack using RabbitMQ, Openstack behavior when RabbitMQ node down and way to make RabbitMQ HA in a stable way. Do you have any recommendation?

TIA, Tan

1217

Age (days ago)

1222

Last active (days ago)

List overview

Download

6 comments

5 participants

participants (5)

Albert Braden
Doug Szumski
Fabian Zimmermann
Satish Patel
Tan Tran Trong