Hi,
I definitely recommend upgrading to 3.8.
Also enable the durable queues.
This helped us a lot in managing our clusters.

We also applied the policy, which was originally taken from openstack ansible [1].

We also collect unroutable messages and send alerts based in that (but we had no issue recently thanks to all above).

Finally we ping all our clients using oslo_ping_endpoint [2] every five minutes so we know when an agent is disconnected.

Dunno about kolla, sorry.

[1] https://github.com/openstack/openstack-ansible-rabbitmq_server/blob/fc27e735a68b64cb3c67dd8abeaf324803a9845b/defaults/main.yml#L172

[2] https://opendev.org/openstack/oslo.messaging/commit/82492442f3387a0e4f19623ccfda64f8b84d59c3

Le 1 décembre 2021 19:15:41 GMT+01:00, "Braden, Albert" <abraden@verisign.com> a écrit :
I read this with great interest because we are seeing this issue. Questions:

1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x?
2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like?
3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into?

Does anyone have a sample set of RMQ config files that they can share?

It looks like my Outlook has ruined the link; reposting:
[2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit

-----Original Message-----
From: Arnaud Morin <arnaud.morin@gmail.com>
Sent: Monday, November 29, 2021 2:04 PM
To: Bogdan Dobrelya <bdobreli@redhat.com>
Cc: DHilsbos@performair.com; openstack-discuss@lists.openstack.org
Subject: [EXTERNAL] Re: [ops]RabbitMQ High Availability

Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Hi,

After a talk on this ml (which starts at [1]), we endup building a
documentation with Large Scale group.
The doc is accessible at [2].

Hope this will help.

[1] http://secure-web.cisco.com/1gFccuTyEVGnFd9aBOZ-RTPG0hbVIPGAbuLBNnoXP4onSZGFG1umIn0EtkEpBJWko4mi6yOUZ8Vsm-5sDGmIVl8rC2sOHv3Z2I1s9lFIkVFyn16CXJgcJbQQ7SBU8wEz5I_TysLtIY6YrmiC3PkKdG4oVCZk6n_KqYPYjmYUmDn9BD6JcXKUbFujVfugbjewZDY4HDCBnTe43tPSqkIZRVarApPiwsFtHu5PQ5riSoSgTpupqZHZdPnnGz7sbVGzx/http%3A%2F%2Flists.openstack.org%2Fpipermail%2Fopenstack-discuss%2F2020-August%2F016362.html
[2] https://secure-web.cisco.com/1OtQ3pcnPPBNwevAFxS8yOS2xFlkHo0tY4SmkFE-wpAU_YPYS-BxRX5omcjCPZ3cMOxefnaO0vc3qlVm_SvI3DpkhejUkQUrrRbBJ72ki_ly13bYzC_QKd0-VERmSnlx8SFUB_DWewMYIZ7JfaURBYN9QvJgwD0b0aG-hYgvxcN1ZCt7qHTDqneGTtpx-5gRUMvld2dFz5uXsPj7QzohumP5bAoTblw7xLJy3zXhlfvrg6aHhQIR4xw9_y8E5Lt7d/https%3A%2F%2Fwiki.openstack.org%2Fwiki%2FLarge_Scale_Configuration_Rabbit

On 24.11.21 - 11:31, Bogdan Dobrelya wrote:
On 11/24/21 12:34 AM, DHilsbos@performair.com wrote:
All;

In the time I've been part of this mailing list, the subject of RabbitMQ high availability has come up several times, and each time specific recommendations for both Rabbit and Open Stack are provided. I remember it being an A or B kind of recommendation (i.e. configure Rabbit like A1, and Open Stack like A2, OR configure Rabbit like B1, and Open Stack like B2).

There is no special recommendations for rabbitmq setup for openstack,
but probably a few, like instead of putting it behind a haproxy, or the
like, list the rabbit cluster nodes in the oslo messaging config
settings directly. Also, it seems that durable queues makes a very
little sense for highly ephemeral RPC calls, just by design. I would
also add that the raft quorum queues feature of rabbitmq >=3.18 does
neither fit well into the oslo messaging design for RPC calls.

A discussable and highly opinionated thing is also configuring
ha/mirror queue policy params for queues used for RPC calls vs
broad-casted notifications.

And my biased personal humble recommendation is: use the upstream OCF RA
[0][1], if configuring rabbitmq cluster by pacemaker.

[0] https://secure-web.cisco.com/1N1wD9gW7NZho0LdTVNuiU2ZIB7NW-eJMfDgVzBH3D3E6URzGYPKa-uhcLHxy3tRvRXopjnLAd2CECD1urJyRpg8NBSxTOEUSPxOlS0cQyULtSQuDbVWr-W7Bl3ZRcdWPrF9EuX_b40IM7zTjqS40gImsEouTqtD1vlCuEoaFgpptDEuMuaNTqBJ0IAtiZHuWiW6E7ufTtgxmVbkGLjXCZw5ZNhibbu-kGVyA-7MQsxQ-RBgSq5peTcLBR2Vx-f9k/https%3A%2F%2Fwww.rabbitmq.com%2Fpacemaker.html%23auto-pacemaker

[1]
https://secure-web.cisco.com/1iDK1NnL9JTkQqkpBda06xTQNrWY2W0pVOTDwUoadfQbSXn5r0g_GH8PB8wZC5-JmHW2-m1YWoj1Z86jFcmWT0m9W9Sax5fJE5G7MbvQN2JM0EbAVHJDCmiBkMZlrSLoTgmh30RGhvmF9ww7jAjVnas3_AYFmwc65P-YtpdcswFC8rYcg5HlE2d979gf2OQUeftP3lfClkVou7hnELIFanDq07MfOJc2exHIfBo2ZQyUXRqXWUqnTsj7df-jCySkz/https%3A%2F%2Fgithub.com%2FClusterLabs%2Fresource-agents%2Fblob%2Fmaster%2Fheartbeat%2Frabbitmq-server-ha


Unfortunately, I can't find the previous threads on this topic.

Does anyone have this information, that they would care to share with me?

Thank you,

Dominic L. Hilsbos, MBA
Vice President - Information Technology
Perform Air International Inc.
DHilsbos@PerformAir.com
www.PerformAir.com





--
Best regards,
Bogdan Dobrelya,
Irc #bogdando