Re: Re: [ops] [kolla] RabbitMQ High Availability
Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management. I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf: [oslo_messaging_rabbit] amqp_durable_queues = False Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config? From: Herve Beraud <hberaud@redhat.com> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com> a écrit : Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into?
Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough.
Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3]. The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other. [3] https://review.opendev.org/q/topic:scope-config-opts
There are also race conditions with durable queues enabled, like [1]. A solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor
[1] https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/...
[2] https://review.opendev.org/c/openstack/nova/+/820523
Does anyone have a sample set of RMQ config files that they can share?
It looks like my Outlook has ruined the link; reposting: [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando -- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
If you plan to let `amqp_durable_queues = False` (i.e if you plan to keep this config equal to false), then you don't need to add these config lines as this is already the default value [1]. [1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messagin... Le jeu. 9 déc. 2021 à 22:40, Albert Braden <ozzzo@yahoo.com> a écrit :
Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management.
I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf:
[oslo_messaging_rabbit] amqp_durable_queues = False
Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config?
From: Herve Beraud <hberaud@redhat.com> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability
Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com> a écrit :
Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into?
Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough.
Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3].
The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other.
[3] https://review.opendev.org/q/topic:scope-config-opts
There are also race conditions with durable queues enabled, like [1]. A solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor
[1]
https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/...
[2] https://review.opendev.org/c/openstack/nova/+/820523
Does anyone have a sample set of RMQ config files that they can share?
It looks like my Outlook has ruined the link; reposting: [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando
--
Hervé Beraud
Senior Software Engineer at Red Hat
irc: hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
Sorry, that was a transcription error. I thought "True" and my fingers typed "False." The correct lines are: [oslo_messaging_rabbit] amqp_durable_queues = True On Friday, December 10, 2021, 02:55:55 AM EST, Herve Beraud <hberaud@redhat.com> wrote: If you plan to let `amqp_durable_queues = False` (i.e if you plan to keep this config equal to false), then you don't need to add these config lines as this is already the default value [1]. [1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messagin... Le jeu. 9 déc. 2021 à 22:40, Albert Braden <ozzzo@yahoo.com> a écrit : Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management. I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf: [oslo_messaging_rabbit] amqp_durable_queues = False Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config? From: Herve Beraud <hberaud@redhat.com> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com> a écrit : Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into?
Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough.
Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3]. The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other. [3] https://review.opendev.org/q/topic:scope-config-opts
There are also race conditions with durable queues enabled, like [1]. A solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor
[1] https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/...
[2] https://review.opendev.org/c/openstack/nova/+/820523
Does anyone have a sample set of RMQ config files that they can share?
It looks like my Outlook has ruined the link; reposting: [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando -- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud -- Hervé BeraudSenior Software Engineer at Red Hatirc: hberaudhttps://github.com/4383/https://twitter.com/4383hberaud
So, your config snippet LGTM. Le ven. 10 déc. 2021 à 17:50, Albert Braden <ozzzo@yahoo.com> a écrit :
Sorry, that was a transcription error. I thought "True" and my fingers typed "False." The correct lines are:
[oslo_messaging_rabbit] amqp_durable_queues = True
On Friday, December 10, 2021, 02:55:55 AM EST, Herve Beraud < hberaud@redhat.com> wrote:
If you plan to let `amqp_durable_queues = False` (i.e if you plan to keep this config equal to false), then you don't need to add these config lines as this is already the default value [1].
[1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messagin...
Le jeu. 9 déc. 2021 à 22:40, Albert Braden <ozzzo@yahoo.com> a écrit :
Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management.
I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf:
[oslo_messaging_rabbit] amqp_durable_queues = False
Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config?
From: Herve Beraud <hberaud@redhat.com> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability
Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com> a écrit :
Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into?
Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough.
Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3].
The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other.
[3] https://review.opendev.org/q/topic:scope-config-opts
There are also race conditions with durable queues enabled, like [1]. A solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor
[1]
https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/...
[2] https://review.opendev.org/c/openstack/nova/+/820523
Does anyone have a sample set of RMQ config files that they can share?
It looks like my Outlook has ruined the link; reposting: [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando
--
Hervé Beraud
Senior Software Engineer at Red Hat
irc: hberaud
https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
Following [1] I successfully set "amqp_durable_queues = True" and restricted HA to the appropriate queues, but I'm having trouble with some of the other settings such as "expires" and "message-ttl". Does anyone have an example of a working kolla config that includes these changes? [1] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit On Monday, December 13, 2021, 07:51:32 AM EST, Herve Beraud <hberaud@redhat.com> wrote: So, your config snippet LGTM. Le ven. 10 déc. 2021 à 17:50, Albert Braden <ozzzo@yahoo.com> a écrit : Sorry, that was a transcription error. I thought "True" and my fingers typed "False." The correct lines are: [oslo_messaging_rabbit] amqp_durable_queues = True On Friday, December 10, 2021, 02:55:55 AM EST, Herve Beraud <hberaud@redhat.com> wrote: If you plan to let `amqp_durable_queues = False` (i.e if you plan to keep this config equal to false), then you don't need to add these config lines as this is already the default value [1]. [1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messagin... Le jeu. 9 déc. 2021 à 22:40, Albert Braden <ozzzo@yahoo.com> a écrit : Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management. I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf: [oslo_messaging_rabbit] amqp_durable_queues = False Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config? From: Herve Beraud <hberaud@redhat.com> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com> a écrit : Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into?
Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough.
Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3]. The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other. [3] https://review.opendev.org/q/topic:scope-config-opts
There are also race conditions with durable queues enabled, like [1]. A solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor
[1] https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/...
[2] https://review.opendev.org/c/openstack/nova/+/820523
Does anyone have a sample set of RMQ config files that they can share?
It looks like my Outlook has ruined the link; reposting: [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando -- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud -- Hervé BeraudSenior Software Engineer at Red Hatirc: hberaudhttps://github.com/4383/https://twitter.com/4383hberaud -- Hervé BeraudSenior Software Engineer at Red Hatirc: hberaudhttps://github.com/4383/https://twitter.com/4383hberaud
I tried these policies in ansible/roles/rabbitmq/templates/definitions.json.j2: "policies":[ {"vhost": "/", "name": "ha-all", "pattern": '^(?!(amq\.)|(.*_fanout_)|(reply_)).*', "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, {"vhost": "/", "name": "notifications-ttl", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"message-ttl":600}, "priority":0} {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"expire":3600}, "priority":0} {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} {% endif %} But I still see unconsumed messages lingering in notifications_extractor.info. From reading the docs I think this setting should cause messages to expire after 600 seconds, and unused queues to be deleted after 3600 seconds. What am I missing? On Tuesday, December 14, 2021, 04:18:09 PM EST, Albert Braden <ozzzo@yahoo.com> wrote: Following [1] I successfully set "amqp_durable_queues = True" and restricted HA to the appropriate queues, but I'm having trouble with some of the other settings such as "expires" and "message-ttl". Does anyone have an example of a working kolla config that includes these changes? [1] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit On Monday, December 13, 2021, 07:51:32 AM EST, Herve Beraud <hberaud@redhat.com> wrote: So, your config snippet LGTM. Le ven. 10 déc. 2021 à 17:50, Albert Braden <ozzzo@yahoo.com> a écrit : Sorry, that was a transcription error. I thought "True" and my fingers typed "False." The correct lines are: [oslo_messaging_rabbit] amqp_durable_queues = True On Friday, December 10, 2021, 02:55:55 AM EST, Herve Beraud <hberaud@redhat.com> wrote: If you plan to let `amqp_durable_queues = False` (i.e if you plan to keep this config equal to false), then you don't need to add these config lines as this is already the default value [1]. [1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messagin... Le jeu. 9 déc. 2021 à 22:40, Albert Braden <ozzzo@yahoo.com> a écrit : Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management. I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf: [oslo_messaging_rabbit] amqp_durable_queues = False Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config? From: Herve Beraud <hberaud@redhat.com> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com> a écrit : Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into?
Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough.
Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3]. The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other. [3] https://review.opendev.org/q/topic:scope-config-opts
There are also race conditions with durable queues enabled, like [1]. A solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor
[1] https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/...
[2] https://review.opendev.org/c/openstack/nova/+/820523
Does anyone have a sample set of RMQ config files that they can share?
It looks like my Outlook has ruined the link; reposting: [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando -- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud -- Hervé BeraudSenior Software Engineer at Red Hatirc: hberaudhttps://github.com/4383/https://twitter.com/4383hberaud -- Hervé BeraudSenior Software Engineer at Red Hatirc: hberaudhttps://github.com/4383/https://twitter.com/4383hberaud
Now that the holidays are over I'm trying this one again. Can anyone help me figure out how to set "expires" and "message-ttl" ? On Thursday, December 16, 2021, 01:43:57 PM EST, Albert Braden <ozzzo@yahoo.com> wrote: I tried these policies in ansible/roles/rabbitmq/templates/definitions.json.j2: "policies":[ {"vhost": "/", "name": "ha-all", "pattern": '^(?!(amq\.)|(.*_fanout_)|(reply_)).*', "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, {"vhost": "/", "name": "notifications-ttl", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"message-ttl":600}, "priority":0} {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"expire":3600}, "priority":0} {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} {% endif %} But I still see unconsumed messages lingering in notifications_extractor.info. From reading the docs I think this setting should cause messages to expire after 600 seconds, and unused queues to be deleted after 3600 seconds. What am I missing? On Tuesday, December 14, 2021, 04:18:09 PM EST, Albert Braden <ozzzo@yahoo.com> wrote: Following [1] I successfully set "amqp_durable_queues = True" and restricted HA to the appropriate queues, but I'm having trouble with some of the other settings such as "expires" and "message-ttl". Does anyone have an example of a working kolla config that includes these changes? [1] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit On Monday, December 13, 2021, 07:51:32 AM EST, Herve Beraud <hberaud@redhat.com> wrote: So, your config snippet LGTM. Le ven. 10 déc. 2021 à 17:50, Albert Braden <ozzzo@yahoo.com> a écrit : Sorry, that was a transcription error. I thought "True" and my fingers typed "False." The correct lines are: [oslo_messaging_rabbit] amqp_durable_queues = True On Friday, December 10, 2021, 02:55:55 AM EST, Herve Beraud <hberaud@redhat.com> wrote: If you plan to let `amqp_durable_queues = False` (i.e if you plan to keep this config equal to false), then you don't need to add these config lines as this is already the default value [1]. [1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messagin... Le jeu. 9 déc. 2021 à 22:40, Albert Braden <ozzzo@yahoo.com> a écrit : Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management. I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf: [oslo_messaging_rabbit] amqp_durable_queues = False Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config? From: Herve Beraud <hberaud@redhat.com> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com> a écrit : Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into?
Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough.
Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3]. The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other. [3] https://review.opendev.org/q/topic:scope-config-opts
There are also race conditions with durable queues enabled, like [1]. A solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor
[1] https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/...
[2] https://review.opendev.org/c/openstack/nova/+/820523
Does anyone have a sample set of RMQ config files that they can share?
It looks like my Outlook has ruined the link; reposting: [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando -- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud -- Hervé BeraudSenior Software Engineer at Red Hatirc: hberaudhttps://github.com/4383/https://twitter.com/4383hberaud -- Hervé BeraudSenior Software Engineer at Red Hatirc: hberaudhttps://github.com/4383/https://twitter.com/4383hberaud
On Tue, 4 Jan 2022 at 14:08, Albert Braden <ozzzo@yahoo.com> wrote:
Now that the holidays are over I'm trying this one again. Can anyone help me figure out how to set "expires" and "message-ttl" ?
John Garbutt proposed a few patches for RabbitMQ in kolla, including this: https://review.opendev.org/c/openstack/kolla-ansible/+/822191 https://review.opendev.org/q/hashtag:%2522rabbitmq%2522+(status:open+OR+stat... Note that they are currently untested. Mark
On Thursday, December 16, 2021, 01:43:57 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
I tried these policies in ansible/roles/rabbitmq/templates/definitions.json.j2:
"policies":[ {"vhost": "/", "name": "ha-all", "pattern": '^(?!(amq\.)|(.*_fanout_)|(reply_)).*', "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, {"vhost": "/", "name": "notifications-ttl", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"message-ttl":600}, "priority":0} {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"expire":3600}, "priority":0} {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} {% endif %}
But I still see unconsumed messages lingering in notifications_extractor.info. From reading the docs I think this setting should cause messages to expire after 600 seconds, and unused queues to be deleted after 3600 seconds. What am I missing? On Tuesday, December 14, 2021, 04:18:09 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
Following [1] I successfully set "amqp_durable_queues = True" and restricted HA to the appropriate queues, but I'm having trouble with some of the other settings such as "expires" and "message-ttl". Does anyone have an example of a working kolla config that includes these changes?
[1] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit On Monday, December 13, 2021, 07:51:32 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
So, your config snippet LGTM.
Le ven. 10 déc. 2021 à 17:50, Albert Braden <ozzzo@yahoo.com> a écrit :
Sorry, that was a transcription error. I thought "True" and my fingers typed "False." The correct lines are:
[oslo_messaging_rabbit] amqp_durable_queues = True
On Friday, December 10, 2021, 02:55:55 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
If you plan to let `amqp_durable_queues = False` (i.e if you plan to keep this config equal to false), then you don't need to add these config lines as this is already the default value [1].
[1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messagin...
Le jeu. 9 déc. 2021 à 22:40, Albert Braden <ozzzo@yahoo.com> a écrit :
Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management.
I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf:
[oslo_messaging_rabbit] amqp_durable_queues = False
Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config?
From: Herve Beraud <hberaud@redhat.com> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability
Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com> a écrit :
Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into?
Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough.
Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3].
The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other.
[3] https://review.opendev.org/q/topic:scope-config-opts
There are also race conditions with durable queues enabled, like [1]. A solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor
[1] https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/...
[2] https://review.opendev.org/c/openstack/nova/+/820523
Does anyone have a sample set of RMQ config files that they can share?
It looks like my Outlook has ruined the link; reposting: [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando
--
Hervé Beraud
Senior Software Engineer at Red Hat
irc: hberaud
https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
This is very helpful. Thank you! It appears that I have successfully set the expire time to 1200, because I no longer see unconsumed messages lingering in my queues, but it's not obvious how to verify. In the web interface, when I look at the queues, I see things like policy, state, features and consumers, but I don't see a timeout or expire value, nor do I find the number 1200 anywhere. Where should I be looking in the web interface to verify that I set the expire time correctly? Or do I need to use the CLI? On Wednesday, January 5, 2022, 04:23:29 AM EST, Mark Goddard <mark@stackhpc.com> wrote: On Tue, 4 Jan 2022 at 14:08, Albert Braden <ozzzo@yahoo.com> wrote:
Now that the holidays are over I'm trying this one again. Can anyone help me figure out how to set "expires" and "message-ttl" ?
John Garbutt proposed a few patches for RabbitMQ in kolla, including this: https://review.opendev.org/c/openstack/kolla-ansible/+/822191 https://review.opendev.org/q/hashtag:%2522rabbitmq%2522+(status:open+OR+stat... Note that they are currently untested. Mark
On Thursday, December 16, 2021, 01:43:57 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
I tried these policies in ansible/roles/rabbitmq/templates/definitions.json.j2:
"policies":[ {"vhost": "/", "name": "ha-all", "pattern": '^(?!(amq\.)|(.*_fanout_)|(reply_)).*', "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, {"vhost": "/", "name": "notifications-ttl", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"message-ttl":600}, "priority":0} {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"expire":3600}, "priority":0} {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} {% endif %}
But I still see unconsumed messages lingering in notifications_extractor.info. From reading the docs I think this setting should cause messages to expire after 600 seconds, and unused queues to be deleted after 3600 seconds. What am I missing? On Tuesday, December 14, 2021, 04:18:09 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
Following [1] I successfully set "amqp_durable_queues = True" and restricted HA to the appropriate queues, but I'm having trouble with some of the other settings such as "expires" and "message-ttl". Does anyone have an example of a working kolla config that includes these changes?
[1] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit On Monday, December 13, 2021, 07:51:32 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
So, your config snippet LGTM.
Le ven. 10 déc. 2021 à 17:50, Albert Braden <ozzzo@yahoo.com> a écrit :
Sorry, that was a transcription error. I thought "True" and my fingers typed "False." The correct lines are:
[oslo_messaging_rabbit] amqp_durable_queues = True
On Friday, December 10, 2021, 02:55:55 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
If you plan to let `amqp_durable_queues = False` (i.e if you plan to keep this config equal to false), then you don't need to add these config lines as this is already the default value [1].
[1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messagin...
Le jeu. 9 déc. 2021 à 22:40, Albert Braden <ozzzo@yahoo.com> a écrit :
Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management.
I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf:
[oslo_messaging_rabbit] amqp_durable_queues = False
Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config?
From: Herve Beraud <hberaud@redhat.com> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability
Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com> a écrit :
Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into?
Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough.
Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3].
The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other.
[3] https://review.opendev.org/q/topic:scope-config-opts
There are also race conditions with durable queues enabled, like [1]. A solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor
[1] https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/...
[2] https://review.opendev.org/c/openstack/nova/+/820523
Does anyone have a sample set of RMQ config files that they can share?
It looks like my Outlook has ruined the link; reposting: [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando
--
Hervé Beraud
Senior Software Engineer at Red Hat
irc: hberaud
https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
In the web interface(RabbitMQ 3.8.23, not using Kolla), when looking at the queue you will see the "Policy" listed by name, and "Effective policy definition". You can either view the policy definition, and the arguments for the definitions applied, or "effective policy definition" should show you the list. It may be relevant to note: "Each exchange or queue will have at most one policy matching" - https://www.rabbitmq.com/parameters.html#how-policies-work I've added a similar comment to the linked patchset. On 13/01/22 7:26 am, Albert Braden wrote:
This is very helpful. Thank you! It appears that I have successfully set the expire time to 1200, because I no longer see unconsumed messages lingering in my queues, but it's not obvious how to verify. In the web interface, when I look at the queues, I see things like policy, state, features and consumers, but I don't see a timeout or expire value, nor do I find the number 1200 anywhere. Where should I be looking in the web interface to verify that I set the expire time correctly? Or do I need to use the CLI? On Wednesday, January 5, 2022, 04:23:29 AM EST, Mark Goddard <mark@stackhpc.com> wrote:
On Tue, 4 Jan 2022 at 14:08, Albert Braden <ozzzo@yahoo.com <mailto:ozzzo@yahoo.com>> wrote:
Now that the holidays are over I'm trying this one again. Can anyone
help me figure out how to set "expires" and "message-ttl" ?
John Garbutt proposed a few patches for RabbitMQ in kolla, including this: https://review.opendev.org/c/openstack/kolla-ansible/+/822191 <https://review.opendev.org/c/openstack/kolla-ansible/+/822191>
https://review.opendev.org/q/hashtag:%2522rabbitmq%2522+ <https://review.opendev.org/q/hashtag:%2522rabbitmq%2522+>(status:open+OR+status:merged)+project:openstack/kolla-ansible
Note that they are currently untested.
Mark
On Thursday, December 16, 2021, 01:43:57 PM EST, Albert Braden <ozzzo@yahoo.com <mailto:ozzzo@yahoo.com>> wrote:
I tried these policies in ansible/roles/rabbitmq/templates/definitions.json.j2:
"policies":[ {"vhost": "/", "name": "ha-all", "pattern": '^(?!(amq\.)|(.*_fanout_)|(reply_)).*', "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, {"vhost": "/", "name": "notifications-ttl", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"message-ttl":600}, "priority":0} {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"expire":3600}, "priority":0} {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} {% endif %}
But I still see unconsumed messages lingering in notifications_extractor.info. From reading the docs I think this setting should cause messages to expire after 600 seconds, and unused queues to be deleted after 3600 seconds. What am I missing? On Tuesday, December 14, 2021, 04:18:09 PM EST, Albert Braden <ozzzo@yahoo.com <mailto:ozzzo@yahoo.com>> wrote:
Following [1] I successfully set "amqp_durable_queues = True" and restricted HA to the appropriate queues, but I'm having trouble with some of the other settings such as "expires" and "message-ttl". Does anyone have an example of a working kolla config that includes these changes?
[1] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit <https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit> On Monday, December 13, 2021, 07:51:32 AM EST, Herve Beraud <hberaud@redhat.com <mailto:hberaud@redhat.com>> wrote:
So, your config snippet LGTM.
Le ven. 10 déc. 2021 à 17:50, Albert Braden <ozzzo@yahoo.com <mailto:ozzzo@yahoo.com>> a écrit :
Sorry, that was a transcription error. I thought "True" and my fingers typed "False." The correct lines are:
[oslo_messaging_rabbit] amqp_durable_queues = True
On Friday, December 10, 2021, 02:55:55 AM EST, Herve Beraud <hberaud@redhat.com <mailto:hberaud@redhat.com>> wrote:
If you plan to let `amqp_durable_queues = False` (i.e if you plan to keep this config equal to false), then you don't need to add these config lines as this is already the default value [1].
[1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messagin... <https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messaging/_drivers/amqp.py#L34>
Le jeu. 9 déc. 2021 à 22:40, Albert Braden <ozzzo@yahoo.com <mailto:ozzzo@yahoo.com>> a écrit :
Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management.
I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf:
[oslo_messaging_rabbit] amqp_durable_queues = False
Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config?
From: Herve Beraud <hberaud@redhat.com <mailto:hberaud@redhat.com>> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com <mailto:bdobreli@redhat.com>> Cc: openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org> Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability
Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com <mailto:bdobreli@redhat.com>> a écrit :
Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into?
Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough.
Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3].
The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other.
[3] https://review.opendev.org/q/topic:scope-config-opts <https://review.opendev.org/q/topic:scope-config-opts>
There are also race conditions with durable queues enabled, like
[1]. A
solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor <https://www.rabbitmq.com/ha.html#replication-factor>
[1]
https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/... <https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/log/controller/logs/screen-n-cpu.txt>
<https://review.opendev.org/c/openstack/nova/+/820523>
Does anyone have a sample set of RMQ config files that they can
share?
It looks like my Outlook has ruined the link; reposting: [2]
https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit <https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit>
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando
--
Hervé Beraud
Senior Software Engineer at Red Hat
irc: hberaud
https://github.com/4383/ <https://github.com/4383/>
https://twitter.com/4383hberaud <https://twitter.com/4383hberaud>
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ <https://github.com/4383/> https://twitter.com/4383hberaud <https://twitter.com/4383hberaud>
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ <https://github.com/4383/> https://twitter.com/4383hberaud <https://twitter.com/4383hberaud>
After digging further I realized that I'm not setting TTL; only queue expiration. Here's what I see in the GUI when I look at affected queues: Policy notifications-expire Effective policy definition expires: 1200 This is what I have in definitions.json.j2: {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"expires":1200}, "priority":0}, I tried this to set both: {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"message-ttl":"{{ rabbitmq_message_ttl | int }}","expires":1200}, "priority":0}, But the RMQ containers restart every 60 seconds and puke this into the log: [error] <0.322.0> CRASH REPORT Process <0.322.0> with 0 neighbours exited with reason: {error,<<"<<\"Validation failed\\n\\n<<\\\"600\\\">> is not a valid message TTL\\n (//notifications-expire)\">>">>} in application_master:init/4 line 138 After reading the doc on TTL: https://www.rabbitmq.com/ttl.html I realized that the TTL is set in ms, so I tried "rabbitmq_message_ttl: 60000" but that only changes the number in the error: [error] <0.318.0> CRASH REPORT Process <0.318.0> with 0 neighbours exited with reason: {error,<<"<<\"Validation failed\\n\\n<<\\\"60000\\\">> is not a valid message TTL\\n (//notifications-expire)\">>">>} in application_master:init/4 line 138 What am I missing? On Wednesday, January 12, 2022, 05:11:41 PM EST, Dale Smith <dale@catalystcloud.nz> wrote: In the web interface(RabbitMQ 3.8.23, not using Kolla), when looking at the queue you will see the "Policy" listed by name, and "Effective policy definition". You can either view the policy definition, and the arguments for the definitions applied, or "effective policy definition" should show you the list. It may be relevant to note: "Each exchange or queue will have at most one policy matching" - https://www.rabbitmq.com/parameters.html#how-policies-work I've added a similar comment to the linked patchset. On 13/01/22 7:26 am, Albert Braden wrote: This is very helpful. Thank you! It appears that I have successfully set the expire time to 1200, because I no longer see unconsumed messages lingering in my queues, but it's not obvious how to verify. In the web interface, when I look at the queues, I see things like policy, state, features and consumers, but I don't see a timeout or expire value, nor do I find the number 1200 anywhere. Where should I be looking in the web interface to verify that I set the expire time correctly? Or do I need to use the CLI? On Wednesday, January 5, 2022, 04:23:29 AM EST, Mark Goddard <mark@stackhpc.com> wrote: On Tue, 4 Jan 2022 at 14:08, Albert Braden <ozzzo@yahoo.com> wrote:
Now that the holidays are over I'm trying this one again. Can anyone help me figure out how to set "expires" and "message-ttl" ?
John Garbutt proposed a few patches for RabbitMQ in kolla, including this: https://review.opendev.org/c/openstack/kolla-ansible/+/822191 https://review.opendev.org/q/hashtag:%2522rabbitmq%2522+(status:open+OR+stat... Note that they are currently untested. Mark
On Thursday, December 16, 2021, 01:43:57 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
I tried these policies in ansible/roles/rabbitmq/templates/definitions.json.j2:
"policies":[ {"vhost": "/", "name": "ha-all", "pattern": '^(?!(amq\.)|(.*_fanout_)|(reply_)).*', "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, {"vhost": "/", "name": "notifications-ttl", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"message-ttl":600}, "priority":0} {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"expire":3600}, "priority":0} {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} {% endif %}
But I still see unconsumed messages lingering in notifications_extractor.info. From reading the docs I think this setting should cause messages to expire after 600 seconds, and unused queues to be deleted after 3600 seconds. What am I missing? On Tuesday, December 14, 2021, 04:18:09 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
Following [1] I successfully set "amqp_durable_queues = True" and restricted HA to the appropriate queues, but I'm having trouble with some of the other settings such as "expires" and "message-ttl". Does anyone have an example of a working kolla config that includes these changes?
[1] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit On Monday, December 13, 2021, 07:51:32 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
So, your config snippet LGTM.
Le ven. 10 déc. 2021 à 17:50, Albert Braden <ozzzo@yahoo.com> a écrit :
Sorry, that was a transcription error. I thought "True" and my fingers typed "False." The correct lines are:
[oslo_messaging_rabbit] amqp_durable_queues = True
On Friday, December 10, 2021, 02:55:55 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
If you plan to let `amqp_durable_queues = False` (i.e if you plan to keep this config equal to false), then you don't need to add these config lines as this is already the default value [1].
[1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messagin...
Le jeu. 9 déc. 2021 à 22:40, Albert Braden <ozzzo@yahoo.com> a écrit :
Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management.
I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf:
[oslo_messaging_rabbit] amqp_durable_queues = False
Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config?
From: Herve Beraud <hberaud@redhat.com> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability
Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com> a écrit :
Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into?
Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough.
Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3].
The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other.
[3] https://review.opendev.org/q/topic:scope-config-opts
There are also race conditions with durable queues enabled, like [1]. A solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor
[1] https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/...
[2] https://review.opendev.org/c/openstack/nova/+/820523
Does anyone have a sample set of RMQ config files that they can share?
It looks like my Outlook has ruined the link; reposting: [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando
--
Hervé Beraud
Senior Software Engineer at Red Hat
irc: hberaud
https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
Update: I googled around and found this: https://tickets.puppetlabs.com/browse/MODULES-2986 Apparently the " | int " isn't working. I tried '60000' and "60000" but that didn't make a difference. In desperation I tried this: {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"message-ttl":60000,"expires":1200}, "priority":0}, That works, but I'd prefer to use a variable. Has anyone done this successfully? Also, am I understanding correctly that "message-ttl" is set in milliseconds and "expires" is set in seconds? Or do I need to use ms for "expires" too? On Thursday, January 13, 2022, 11:03:11 AM EST, Albert Braden <ozzzo@yahoo.com> wrote: After digging further I realized that I'm not setting TTL; only queue expiration. Here's what I see in the GUI when I look at affected queues: Policy notifications-expire Effective policy definition expires: 1200 This is what I have in definitions.json.j2: {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"expires":1200}, "priority":0}, I tried this to set both: {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"message-ttl":"{{ rabbitmq_message_ttl | int }}","expires":1200}, "priority":0}, But the RMQ containers restart every 60 seconds and puke this into the log: [error] <0.322.0> CRASH REPORT Process <0.322.0> with 0 neighbours exited with reason: {error,<<"<<\"Validation failed\\n\\n<<\\\"600\\\">> is not a valid message TTL\\n (//notifications-expire)\">>">>} in application_master:init/4 line 138 After reading the doc on TTL: https://www.rabbitmq.com/ttl.html I realized that the TTL is set in ms, so I tried "rabbitmq_message_ttl: 60000" but that only changes the number in the error: [error] <0.318.0> CRASH REPORT Process <0.318.0> with 0 neighbours exited with reason: {error,<<"<<\"Validation failed\\n\\n<<\\\"60000\\\">> is not a valid message TTL\\n (//notifications-expire)\">>">>} in application_master:init/4 line 138 What am I missing? On Wednesday, January 12, 2022, 05:11:41 PM EST, Dale Smith <dale@catalystcloud.nz> wrote: In the web interface(RabbitMQ 3.8.23, not using Kolla), when looking at the queue you will see the "Policy" listed by name, and "Effective policy definition". You can either view the policy definition, and the arguments for the definitions applied, or "effective policy definition" should show you the list. It may be relevant to note: "Each exchange or queue will have at most one policy matching" - https://www.rabbitmq.com/parameters.html#how-policies-work I've added a similar comment to the linked patchset. On 13/01/22 7:26 am, Albert Braden wrote: This is very helpful. Thank you! It appears that I have successfully set the expire time to 1200, because I no longer see unconsumed messages lingering in my queues, but it's not obvious how to verify. In the web interface, when I look at the queues, I see things like policy, state, features and consumers, but I don't see a timeout or expire value, nor do I find the number 1200 anywhere. Where should I be looking in the web interface to verify that I set the expire time correctly? Or do I need to use the CLI? On Wednesday, January 5, 2022, 04:23:29 AM EST, Mark Goddard <mark@stackhpc.com> wrote: On Tue, 4 Jan 2022 at 14:08, Albert Braden <ozzzo@yahoo.com> wrote:
Now that the holidays are over I'm trying this one again. Can anyone help me figure out how to set "expires" and "message-ttl" ?
John Garbutt proposed a few patches for RabbitMQ in kolla, including this: https://review.opendev.org/c/openstack/kolla-ansible/+/822191 https://review.opendev.org/q/hashtag:%2522rabbitmq%2522+(status:open+OR+stat... Note that they are currently untested. Mark
On Thursday, December 16, 2021, 01:43:57 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
I tried these policies in ansible/roles/rabbitmq/templates/definitions.json.j2:
"policies":[ {"vhost": "/", "name": "ha-all", "pattern": '^(?!(amq\.)|(.*_fanout_)|(reply_)).*', "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, {"vhost": "/", "name": "notifications-ttl", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"message-ttl":600}, "priority":0} {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"expire":3600}, "priority":0} {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} {% endif %}
But I still see unconsumed messages lingering in notifications_extractor.info. From reading the docs I think this setting should cause messages to expire after 600 seconds, and unused queues to be deleted after 3600 seconds. What am I missing? On Tuesday, December 14, 2021, 04:18:09 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
Following [1] I successfully set "amqp_durable_queues = True" and restricted HA to the appropriate queues, but I'm having trouble with some of the other settings such as "expires" and "message-ttl". Does anyone have an example of a working kolla config that includes these changes?
[1] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit On Monday, December 13, 2021, 07:51:32 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
So, your config snippet LGTM.
Le ven. 10 déc. 2021 à 17:50, Albert Braden <ozzzo@yahoo.com> a écrit :
Sorry, that was a transcription error. I thought "True" and my fingers typed "False." The correct lines are:
[oslo_messaging_rabbit] amqp_durable_queues = True
On Friday, December 10, 2021, 02:55:55 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
If you plan to let `amqp_durable_queues = False` (i.e if you plan to keep this config equal to false), then you don't need to add these config lines as this is already the default value [1].
[1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messagin...
Le jeu. 9 déc. 2021 à 22:40, Albert Braden <ozzzo@yahoo.com> a écrit :
Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management.
I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf:
[oslo_messaging_rabbit] amqp_durable_queues = False
Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config?
From: Herve Beraud <hberaud@redhat.com> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability
Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com> a écrit :
Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into?
Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough.
Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3].
The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other.
[3] https://review.opendev.org/q/topic:scope-config-opts
There are also race conditions with durable queues enabled, like [1]. A solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor
[1] https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/...
[2] https://review.opendev.org/c/openstack/nova/+/820523
Does anyone have a sample set of RMQ config files that they can share?
It looks like my Outlook has ruined the link; reposting: [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando
--
Hervé Beraud
Senior Software Engineer at Red Hat
irc: hberaud
https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
After reading more I realize that "expires" is also set in ms. So it looks like the correct settings are: message-ttl: 60000 expires: 120000 This would expire messages in 10 minutes and queues in 20 minutes. The only remaining question is, how can I specify these in a variable without generating the "not a valid message TTL" error? On Thursday, January 13, 2022, 01:22:33 PM EST, Albert Braden <ozzzo@yahoo.com> wrote: Update: I googled around and found this: https://tickets.puppetlabs.com/browse/MODULES-2986 Apparently the " | int " isn't working. I tried '60000' and "60000" but that didn't make a difference. In desperation I tried this: {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"message-ttl":60000,"expires":1200}, "priority":0}, That works, but I'd prefer to use a variable. Has anyone done this successfully? Also, am I understanding correctly that "message-ttl" is set in milliseconds and "expires" is set in seconds? Or do I need to use ms for "expires" too? On Thursday, January 13, 2022, 11:03:11 AM EST, Albert Braden <ozzzo@yahoo.com> wrote: After digging further I realized that I'm not setting TTL; only queue expiration. Here's what I see in the GUI when I look at affected queues: Policy notifications-expire Effective policy definition expires: 1200 This is what I have in definitions.json.j2: {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"expires":1200}, "priority":0}, I tried this to set both: {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"message-ttl":"{{ rabbitmq_message_ttl | int }}","expires":1200}, "priority":0}, But the RMQ containers restart every 60 seconds and puke this into the log: [error] <0.322.0> CRASH REPORT Process <0.322.0> with 0 neighbours exited with reason: {error,<<"<<\"Validation failed\\n\\n<<\\\"600\\\">> is not a valid message TTL\\n (//notifications-expire)\">>">>} in application_master:init/4 line 138 After reading the doc on TTL: https://www.rabbitmq.com/ttl.html I realized that the TTL is set in ms, so I tried "rabbitmq_message_ttl: 60000" but that only changes the number in the error: [error] <0.318.0> CRASH REPORT Process <0.318.0> with 0 neighbours exited with reason: {error,<<"<<\"Validation failed\\n\\n<<\\\"60000\\\">> is not a valid message TTL\\n (//notifications-expire)\">>">>} in application_master:init/4 line 138 What am I missing? On Wednesday, January 12, 2022, 05:11:41 PM EST, Dale Smith <dale@catalystcloud.nz> wrote: In the web interface(RabbitMQ 3.8.23, not using Kolla), when looking at the queue you will see the "Policy" listed by name, and "Effective policy definition". You can either view the policy definition, and the arguments for the definitions applied, or "effective policy definition" should show you the list. It may be relevant to note: "Each exchange or queue will have at most one policy matching" - https://www.rabbitmq.com/parameters.html#how-policies-work I've added a similar comment to the linked patchset. On 13/01/22 7:26 am, Albert Braden wrote: This is very helpful. Thank you! It appears that I have successfully set the expire time to 1200, because I no longer see unconsumed messages lingering in my queues, but it's not obvious how to verify. In the web interface, when I look at the queues, I see things like policy, state, features and consumers, but I don't see a timeout or expire value, nor do I find the number 1200 anywhere. Where should I be looking in the web interface to verify that I set the expire time correctly? Or do I need to use the CLI? On Wednesday, January 5, 2022, 04:23:29 AM EST, Mark Goddard <mark@stackhpc.com> wrote: On Tue, 4 Jan 2022 at 14:08, Albert Braden <ozzzo@yahoo.com> wrote:
Now that the holidays are over I'm trying this one again. Can anyone help me figure out how to set "expires" and "message-ttl" ?
John Garbutt proposed a few patches for RabbitMQ in kolla, including this: https://review.opendev.org/c/openstack/kolla-ansible/+/822191 https://review.opendev.org/q/hashtag:%2522rabbitmq%2522+(status:open+OR+stat... Note that they are currently untested. Mark
On Thursday, December 16, 2021, 01:43:57 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
I tried these policies in ansible/roles/rabbitmq/templates/definitions.json.j2:
"policies":[ {"vhost": "/", "name": "ha-all", "pattern": '^(?!(amq\.)|(.*_fanout_)|(reply_)).*', "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, {"vhost": "/", "name": "notifications-ttl", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"message-ttl":600}, "priority":0} {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"expire":3600}, "priority":0} {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} {% endif %}
But I still see unconsumed messages lingering in notifications_extractor.info. From reading the docs I think this setting should cause messages to expire after 600 seconds, and unused queues to be deleted after 3600 seconds. What am I missing? On Tuesday, December 14, 2021, 04:18:09 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
Following [1] I successfully set "amqp_durable_queues = True" and restricted HA to the appropriate queues, but I'm having trouble with some of the other settings such as "expires" and "message-ttl". Does anyone have an example of a working kolla config that includes these changes?
[1] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit On Monday, December 13, 2021, 07:51:32 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
So, your config snippet LGTM.
Le ven. 10 déc. 2021 à 17:50, Albert Braden <ozzzo@yahoo.com> a écrit :
Sorry, that was a transcription error. I thought "True" and my fingers typed "False." The correct lines are:
[oslo_messaging_rabbit] amqp_durable_queues = True
On Friday, December 10, 2021, 02:55:55 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
If you plan to let `amqp_durable_queues = False` (i.e if you plan to keep this config equal to false), then you don't need to add these config lines as this is already the default value [1].
[1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messagin...
Le jeu. 9 déc. 2021 à 22:40, Albert Braden <ozzzo@yahoo.com> a écrit :
Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management.
I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf:
[oslo_messaging_rabbit] amqp_durable_queues = False
Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config?
From: Herve Beraud <hberaud@redhat.com> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability
Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com> a écrit :
Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into?
Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough.
Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3].
The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other.
[3] https://review.opendev.org/q/topic:scope-config-opts
There are also race conditions with durable queues enabled, like [1]. A solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor
[1] https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/...
[2] https://review.opendev.org/c/openstack/nova/+/820523
Does anyone have a sample set of RMQ config files that they can share?
It looks like my Outlook has ruined the link; reposting: [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando
--
Hervé Beraud
Senior Software Engineer at Red Hat
irc: hberaud
https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
Drop the double quotes around On Thu, 13 Jan 2022 at 18:55, Albert Braden <ozzzo@yahoo.com> wrote:
After reading more I realize that "expires" is also set in ms. So it looks like the correct settings are:
message-ttl: 60000 expires: 120000
This would expire messages in 10 minutes and queues in 20 minutes.
The only remaining question is, how can I specify these in a variable without generating the "not a valid message TTL" error? On Thursday, January 13, 2022, 01:22:33 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
Update: I googled around and found this: https://tickets.puppetlabs.com/browse/MODULES-2986
Apparently the " | int " isn't working. I tried '60000' and "60000" but that didn't make a difference. In desperation I tried this:
{"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"message-ttl":60000,"expires":1200}, "priority":0},
That works, but I'd prefer to use a variable. Has anyone done this successfully? Also, am I understanding correctly that "message-ttl" is set in milliseconds and "expires" is set in seconds? Or do I need to use ms for "expires" too? On Thursday, January 13, 2022, 11:03:11 AM EST, Albert Braden <ozzzo@yahoo.com> wrote:
After digging further I realized that I'm not setting TTL; only queue expiration. Here's what I see in the GUI when I look at affected queues:
Policy notifications-expire Effective policy definition expires: 1200
This is what I have in definitions.json.j2:
{"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"expires":1200}, "priority":0},
I tried this to set both:
{"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"message-ttl":"{{ rabbitmq_message_ttl | int }}","expires":1200}, "priority":0},
Drop the double quotes around the jinja expression. It's not YAML, so you don't need them. Please update the upstream patches with any fixes.
But the RMQ containers restart every 60 seconds and puke this into the log:
[error] <0.322.0> CRASH REPORT Process <0.322.0> with 0 neighbours exited with reason: {error,<<"<<\"Validation failed\\n\\n<<\\\"600\\\">> is not a valid message TTL\\n (//notifications-expire)\">>">>} in application_master:init/4 line 138
After reading the doc on TTL: https://www.rabbitmq.com/ttl.html I realized that the TTL is set in ms, so I tried "rabbitmq_message_ttl: 60000"
but that only changes the number in the error:
[error] <0.318.0> CRASH REPORT Process <0.318.0> with 0 neighbours exited with reason: {error,<<"<<\"Validation failed\\n\\n<<\\\"60000\\\">> is not a valid message TTL\\n (//notifications-expire)\">>">>} in application_master:init/4 line 138
What am I missing?
On Wednesday, January 12, 2022, 05:11:41 PM EST, Dale Smith <dale@catalystcloud.nz> wrote:
In the web interface(RabbitMQ 3.8.23, not using Kolla), when looking at the queue you will see the "Policy" listed by name, and "Effective policy definition".
You can either view the policy definition, and the arguments for the definitions applied, or "effective policy definition" should show you the list.
It may be relevant to note: "Each exchange or queue will have at most one policy matching" - https://www.rabbitmq.com/parameters.html#how-policies-work
I've added a similar comment to the linked patchset.
On 13/01/22 7:26 am, Albert Braden wrote:
This is very helpful. Thank you! It appears that I have successfully set the expire time to 1200, because I no longer see unconsumed messages lingering in my queues, but it's not obvious how to verify. In the web interface, when I look at the queues, I see things like policy, state, features and consumers, but I don't see a timeout or expire value, nor do I find the number 1200 anywhere. Where should I be looking in the web interface to verify that I set the expire time correctly? Or do I need to use the CLI? On Wednesday, January 5, 2022, 04:23:29 AM EST, Mark Goddard <mark@stackhpc.com> wrote:
On Tue, 4 Jan 2022 at 14:08, Albert Braden <ozzzo@yahoo.com> wrote:
Now that the holidays are over I'm trying this one again. Can anyone help me figure out how to set "expires" and "message-ttl" ?
John Garbutt proposed a few patches for RabbitMQ in kolla, including this: https://review.opendev.org/c/openstack/kolla-ansible/+/822191
https://review.opendev.org/q/hashtag:%2522rabbitmq%2522+(status:open+OR+stat...
Note that they are currently untested.
Mark
On Thursday, December 16, 2021, 01:43:57 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
I tried these policies in ansible/roles/rabbitmq/templates/definitions.json.j2:
"policies":[ {"vhost": "/", "name": "ha-all", "pattern": '^(?!(amq\.)|(.*_fanout_)|(reply_)).*', "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, {"vhost": "/", "name": "notifications-ttl", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"message-ttl":600}, "priority":0} {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"expire":3600}, "priority":0} {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} {% endif %}
But I still see unconsumed messages lingering in notifications_extractor.info. From reading the docs I think this setting should cause messages to expire after 600 seconds, and unused queues to be deleted after 3600 seconds. What am I missing? On Tuesday, December 14, 2021, 04:18:09 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
Following [1] I successfully set "amqp_durable_queues = True" and restricted HA to the appropriate queues, but I'm having trouble with some of the other settings such as "expires" and "message-ttl". Does anyone have an example of a working kolla config that includes these changes?
[1] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit On Monday, December 13, 2021, 07:51:32 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
So, your config snippet LGTM.
Le ven. 10 déc. 2021 à 17:50, Albert Braden <ozzzo@yahoo.com> a écrit :
Sorry, that was a transcription error. I thought "True" and my fingers typed "False." The correct lines are:
[oslo_messaging_rabbit] amqp_durable_queues = True
On Friday, December 10, 2021, 02:55:55 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
If you plan to let `amqp_durable_queues = False` (i.e if you plan to keep this config equal to false), then you don't need to add these config lines as this is already the default value [1].
[1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messagin...
Le jeu. 9 déc. 2021 à 22:40, Albert Braden <ozzzo@yahoo.com> a écrit :
Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management.
I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf:
[oslo_messaging_rabbit] amqp_durable_queues = False
Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config?
From: Herve Beraud <hberaud@redhat.com> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability
Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com> a écrit :
Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into?
Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough.
Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3].
The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other.
[3] https://review.opendev.org/q/topic:scope-config-opts
There are also race conditions with durable queues enabled, like [1]. A solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor
[1] https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/...
[2] https://review.opendev.org/c/openstack/nova/+/820523
Does anyone have a sample set of RMQ config files that they can share?
It looks like my Outlook has ruined the link; reposting: [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando
--
Hervé Beraud
Senior Software Engineer at Red Hat
irc: hberaud
https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
On 17/01/2022 09:21, Mark Goddard wrote:
Drop the double quotes around
On Thu, 13 Jan 2022 at 18:55, Albert Braden <ozzzo@yahoo.com> wrote:
After reading more I realize that "expires" is also set in ms. So it looks like the correct settings are:
message-ttl: 60000 expires: 120000
This would expire messages in 10 minutes and queues in 20 minutes.
The only remaining question is, how can I specify these in a variable without generating the "not a valid message TTL" error? On Thursday, January 13, 2022, 01:22:33 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
Update: I googled around and found this: https://tickets.puppetlabs.com/browse/MODULES-2986
Apparently the " | int " isn't working. I tried '60000' and "60000" but that didn't make a difference. In desperation I tried this:
{"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"message-ttl":60000,"expires":1200}, "priority":0},
That works, but I'd prefer to use a variable. Has anyone done this successfully? Also, am I understanding correctly that "message-ttl" is set in milliseconds and "expires" is set in seconds? Or do I need to use ms for "expires" too? On Thursday, January 13, 2022, 11:03:11 AM EST, Albert Braden <ozzzo@yahoo.com> wrote:
After digging further I realized that I'm not setting TTL; only queue expiration. Here's what I see in the GUI when I look at affected queues:
Policy notifications-expire Effective policy definition expires: 1200
This is what I have in definitions.json.j2:
{"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"expires":1200}, "priority":0},
I tried this to set both:
{"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"message-ttl":"{{ rabbitmq_message_ttl | int }}","expires":1200}, "priority":0}, Drop the double quotes around the jinja expression. It's not YAML, so you don't need them.
Please update the upstream patches with any fixes.
But the RMQ containers restart every 60 seconds and puke this into the log:
[error] <0.322.0> CRASH REPORT Process <0.322.0> with 0 neighbours exited with reason: {error,<<"<<\"Validation failed\\n\\n<<\\\"600\\\">> is not a valid message TTL\\n (//notifications-expire)\">>">>} in application_master:init/4 line 138
After reading the doc on TTL: https://www.rabbitmq.com/ttl.html I realized that the TTL is set in ms, so I tried "rabbitmq_message_ttl: 60000"
but that only changes the number in the error:
[error] <0.318.0> CRASH REPORT Process <0.318.0> with 0 neighbours exited with reason: {error,<<"<<\"Validation failed\\n\\n<<\\\"60000\\\">> is not a valid message TTL\\n (//notifications-expire)\">>">>} in application_master:init/4 line 138
What am I missing?
On Wednesday, January 12, 2022, 05:11:41 PM EST, Dale Smith <dale@catalystcloud.nz> wrote:
In the web interface(RabbitMQ 3.8.23, not using Kolla), when looking at the queue you will see the "Policy" listed by name, and "Effective policy definition".
You can either view the policy definition, and the arguments for the definitions applied, or "effective policy definition" should show you the list.
It may be relevant to note: "Each exchange or queue will have at most one policy matching" - https://www.rabbitmq.com/parameters.html#how-policies-work
I've added a similar comment to the linked patchset.
On 13/01/22 7:26 am, Albert Braden wrote:
This is very helpful. Thank you! It appears that I have successfully set the expire time to 1200, because I no longer see unconsumed messages lingering in my queues, but it's not obvious how to verify. In the web interface, when I look at the queues, I see things like policy, state, features and consumers, but I don't see a timeout or expire value, nor do I find the number 1200 anywhere. Where should I be looking in the web interface to verify that I set the expire time correctly? Or do I need to use the CLI? On Wednesday, January 5, 2022, 04:23:29 AM EST, Mark Goddard <mark@stackhpc.com> wrote:
Now that the holidays are over I'm trying this one again. Can anyone help me figure out how to set "expires" and "message-ttl" ? John Garbutt proposed a few patches for RabbitMQ in kolla, including
On Tue, 4 Jan 2022 at 14:08, Albert Braden <ozzzo@yahoo.com> wrote: this: https://review.opendev.org/c/openstack/kolla-ansible/+/822191
https://review.opendev.org/q/hashtag:%2522rabbitmq%2522+(status:open+OR+stat...
Note that they are currently untested.
I've proposed one more as an alternative to reducing the number of queue mirrors (disable all mirroring): https://review.opendev.org/c/openstack/kolla-ansible/+/824994 The reasoning behind it is in the commit message. It's partly justified by the fact that we quite frequently have to 'reset' RabbitMQ with the current transient mirrored configuration by removing all state anyway.
Mark
On Thursday, December 16, 2021, 01:43:57 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
I tried these policies in ansible/roles/rabbitmq/templates/definitions.json.j2:
"policies":[ {"vhost": "/", "name": "ha-all", "pattern": '^(?!(amq\.)|(.*_fanout_)|(reply_)).*', "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, {"vhost": "/", "name": "notifications-ttl", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"message-ttl":600}, "priority":0} {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"expire":3600}, "priority":0} {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} {% endif %}
But I still see unconsumed messages lingering in notifications_extractor.info. From reading the docs I think this setting should cause messages to expire after 600 seconds, and unused queues to be deleted after 3600 seconds. What am I missing? On Tuesday, December 14, 2021, 04:18:09 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
Following [1] I successfully set "amqp_durable_queues = True" and restricted HA to the appropriate queues, but I'm having trouble with some of the other settings such as "expires" and "message-ttl". Does anyone have an example of a working kolla config that includes these changes?
[1] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit On Monday, December 13, 2021, 07:51:32 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
So, your config snippet LGTM.
Le ven. 10 déc. 2021 à 17:50, Albert Braden <ozzzo@yahoo.com> a écrit :
Sorry, that was a transcription error. I thought "True" and my fingers typed "False." The correct lines are:
[oslo_messaging_rabbit] amqp_durable_queues = True
On Friday, December 10, 2021, 02:55:55 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
If you plan to let `amqp_durable_queues = False` (i.e if you plan to keep this config equal to false), then you don't need to add these config lines as this is already the default value [1].
[1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messagin...
Le jeu. 9 déc. 2021 à 22:40, Albert Braden <ozzzo@yahoo.com> a écrit :
Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management.
I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf:
[oslo_messaging_rabbit] amqp_durable_queues = False
Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config?
From: Herve Beraud <hberaud@redhat.com> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability
Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com> a écrit :
Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into? Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough. Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3].
The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other.
[3] https://review.opendev.org/q/topic:scope-config-opts
There are also race conditions with durable queues enabled, like [1]. A solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor
[1] https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/...
[2] https://review.opendev.org/c/openstack/nova/+/820523
Does anyone have a sample set of RMQ config files that they can share?
It looks like my Outlook has ruined the link; reposting: [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando
--
Hervé Beraud
Senior Software Engineer at Red Hat
irc: hberaud
https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
Hey Doug, That's nice piece of information! Would it be possible for you to update the wiki at [1] with your last data if you think this could be relevant? Cheers, Arnaud [1] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit Le 17 janvier 2022 18:01:57 GMT+01:00, Doug Szumski <doug@stackhpc.com> a écrit :
On 17/01/2022 09:21, Mark Goddard wrote:
Drop the double quotes around
On Thu, 13 Jan 2022 at 18:55, Albert Braden <ozzzo@yahoo.com> wrote:
After reading more I realize that "expires" is also set in ms. So it looks like the correct settings are:
message-ttl: 60000 expires: 120000
This would expire messages in 10 minutes and queues in 20 minutes.
The only remaining question is, how can I specify these in a variable without generating the "not a valid message TTL" error? On Thursday, January 13, 2022, 01:22:33 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
Update: I googled around and found this: https://tickets.puppetlabs.com/browse/MODULES-2986
Apparently the " | int " isn't working. I tried '60000' and "60000" but that didn't make a difference. In desperation I tried this:
{"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"message-ttl":60000,"expires":1200}, "priority":0},
That works, but I'd prefer to use a variable. Has anyone done this successfully? Also, am I understanding correctly that "message-ttl" is set in milliseconds and "expires" is set in seconds? Or do I need to use ms for "expires" too? On Thursday, January 13, 2022, 11:03:11 AM EST, Albert Braden <ozzzo@yahoo.com> wrote:
After digging further I realized that I'm not setting TTL; only queue expiration. Here's what I see in the GUI when I look at affected queues:
Policy notifications-expire Effective policy definition expires: 1200
This is what I have in definitions.json.j2:
{"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"expires":1200}, "priority":0},
I tried this to set both:
{"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"message-ttl":"{{ rabbitmq_message_ttl | int }}","expires":1200}, "priority":0}, Drop the double quotes around the jinja expression. It's not YAML, so you don't need them.
Please update the upstream patches with any fixes.
But the RMQ containers restart every 60 seconds and puke this into the log:
[error] <0.322.0> CRASH REPORT Process <0.322.0> with 0 neighbours exited with reason: {error,<<"<<\"Validation failed\\n\\n<<\\\"600\\\">> is not a valid message TTL\\n (//notifications-expire)\">>">>} in application_master:init/4 line 138
After reading the doc on TTL: https://www.rabbitmq.com/ttl.html I realized that the TTL is set in ms, so I tried "rabbitmq_message_ttl: 60000"
but that only changes the number in the error:
[error] <0.318.0> CRASH REPORT Process <0.318.0> with 0 neighbours exited with reason: {error,<<"<<\"Validation failed\\n\\n<<\\\"60000\\\">> is not a valid message TTL\\n (//notifications-expire)\">>">>} in application_master:init/4 line 138
What am I missing?
On Wednesday, January 12, 2022, 05:11:41 PM EST, Dale Smith <dale@catalystcloud.nz> wrote:
In the web interface(RabbitMQ 3.8.23, not using Kolla), when looking at the queue you will see the "Policy" listed by name, and "Effective policy definition".
You can either view the policy definition, and the arguments for the definitions applied, or "effective policy definition" should show you the list.
It may be relevant to note: "Each exchange or queue will have at most one policy matching" - https://www.rabbitmq.com/parameters.html#how-policies-work
I've added a similar comment to the linked patchset.
On 13/01/22 7:26 am, Albert Braden wrote:
This is very helpful. Thank you! It appears that I have successfully set the expire time to 1200, because I no longer see unconsumed messages lingering in my queues, but it's not obvious how to verify. In the web interface, when I look at the queues, I see things like policy, state, features and consumers, but I don't see a timeout or expire value, nor do I find the number 1200 anywhere. Where should I be looking in the web interface to verify that I set the expire time correctly? Or do I need to use the CLI? On Wednesday, January 5, 2022, 04:23:29 AM EST, Mark Goddard <mark@stackhpc.com> wrote:
Now that the holidays are over I'm trying this one again. Can anyone help me figure out how to set "expires" and "message-ttl" ? John Garbutt proposed a few patches for RabbitMQ in kolla, including
On Tue, 4 Jan 2022 at 14:08, Albert Braden <ozzzo@yahoo.com> wrote: this: https://review.opendev.org/c/openstack/kolla-ansible/+/822191
https://review.opendev.org/q/hashtag:%2522rabbitmq%2522+(status:open+OR+stat...
Note that they are currently untested.
I've proposed one more as an alternative to reducing the number of queue mirrors (disable all mirroring):
https://review.opendev.org/c/openstack/kolla-ansible/+/824994
The reasoning behind it is in the commit message. It's partly justified by the fact that we quite frequently have to 'reset' RabbitMQ with the current transient mirrored configuration by removing all state anyway.
Mark
On Thursday, December 16, 2021, 01:43:57 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
I tried these policies in ansible/roles/rabbitmq/templates/definitions.json.j2:
"policies":[ {"vhost": "/", "name": "ha-all", "pattern": '^(?!(amq\.)|(.*_fanout_)|(reply_)).*', "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, {"vhost": "/", "name": "notifications-ttl", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"message-ttl":600}, "priority":0} {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"expire":3600}, "priority":0} {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} {% endif %}
But I still see unconsumed messages lingering in notifications_extractor.info. From reading the docs I think this setting should cause messages to expire after 600 seconds, and unused queues to be deleted after 3600 seconds. What am I missing? On Tuesday, December 14, 2021, 04:18:09 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
Following [1] I successfully set "amqp_durable_queues = True" and restricted HA to the appropriate queues, but I'm having trouble with some of the other settings such as "expires" and "message-ttl". Does anyone have an example of a working kolla config that includes these changes?
[1] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit On Monday, December 13, 2021, 07:51:32 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
So, your config snippet LGTM.
Le ven. 10 déc. 2021 à 17:50, Albert Braden <ozzzo@yahoo.com> a écrit :
Sorry, that was a transcription error. I thought "True" and my fingers typed "False." The correct lines are:
[oslo_messaging_rabbit] amqp_durable_queues = True
On Friday, December 10, 2021, 02:55:55 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
If you plan to let `amqp_durable_queues = False` (i.e if you plan to keep this config equal to false), then you don't need to add these config lines as this is already the default value [1].
[1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messagin...
Le jeu. 9 déc. 2021 à 22:40, Albert Braden <ozzzo@yahoo.com> a écrit :
Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management.
I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf:
[oslo_messaging_rabbit] amqp_durable_queues = False
Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config?
From: Herve Beraud <hberaud@redhat.com> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability
Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com> a écrit :
Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into? Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough. Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3].
The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other.
[3] https://review.opendev.org/q/topic:scope-config-opts
There are also race conditions with durable queues enabled, like [1]. A solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor
[1] https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/...
[2] https://review.opendev.org/c/openstack/nova/+/820523
Does anyone have a sample set of RMQ config files that they can share?
It looks like my Outlook has ruined the link; reposting: [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando
--
Hervé Beraud
Senior Software Engineer at Red Hat
irc: hberaud
https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
That fixed the problem, thank you! I was able to successfully set TTL and expiration. I added 2 comments to [1] and my co-worker updated [2] with the correct ms values. [1] https://review.opendev.org/c/openstack/kolla-ansible/+/822191 [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit On Monday, January 17, 2022, 04:49:18 AM EST, Mark Goddard <mark@stackhpc.com> wrote: Drop the double quotes around On Thu, 13 Jan 2022 at 18:55, Albert Braden <ozzzo@yahoo.com> wrote:
After reading more I realize that "expires" is also set in ms. So it looks like the correct settings are:
message-ttl: 60000 expires: 120000
This would expire messages in 10 minutes and queues in 20 minutes.
The only remaining question is, how can I specify these in a variable without generating the "not a valid message TTL" error? On Thursday, January 13, 2022, 01:22:33 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
Update: I googled around and found this: https://tickets.puppetlabs.com/browse/MODULES-2986
Apparently the " | int " isn't working. I tried '60000' and "60000" but that didn't make a difference. In desperation I tried this:
{"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"message-ttl":60000,"expires":1200}, "priority":0},
That works, but I'd prefer to use a variable. Has anyone done this successfully? Also, am I understanding correctly that "message-ttl" is set in milliseconds and "expires" is set in seconds? Or do I need to use ms for "expires" too? On Thursday, January 13, 2022, 11:03:11 AM EST, Albert Braden <ozzzo@yahoo.com> wrote:
After digging further I realized that I'm not setting TTL; only queue expiration. Here's what I see in the GUI when I look at affected queues:
Policy notifications-expire Effective policy definition expires: 1200
This is what I have in definitions.json.j2:
{"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"expires":1200}, "priority":0},
I tried this to set both:
{"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications).*", "apply-to": "queues", "definition": {"message-ttl":"{{ rabbitmq_message_ttl | int }}","expires":1200}, "priority":0},
Drop the double quotes around the jinja expression. It's not YAML, so you don't need them. Please update the upstream patches with any fixes.
But the RMQ containers restart every 60 seconds and puke this into the log:
[error] <0.322.0> CRASH REPORT Process <0.322.0> with 0 neighbours exited with reason: {error,<<"<<\"Validation failed\\n\\n<<\\\"600\\\">> is not a valid message TTL\\n (//notifications-expire)\">>">>} in application_master:init/4 line 138
After reading the doc on TTL: https://www.rabbitmq.com/ttl.html I realized that the TTL is set in ms, so I tried "rabbitmq_message_ttl: 60000"
but that only changes the number in the error:
[error] <0.318.0> CRASH REPORT Process <0.318.0> with 0 neighbours exited with reason: {error,<<"<<\"Validation failed\\n\\n<<\\\"60000\\\">> is not a valid message TTL\\n (//notifications-expire)\">>">>} in application_master:init/4 line 138
What am I missing?
On Wednesday, January 12, 2022, 05:11:41 PM EST, Dale Smith <dale@catalystcloud.nz> wrote:
In the web interface(RabbitMQ 3.8.23, not using Kolla), when looking at the queue you will see the "Policy" listed by name, and "Effective policy definition".
You can either view the policy definition, and the arguments for the definitions applied, or "effective policy definition" should show you the list.
It may be relevant to note: "Each exchange or queue will have at most one policy matching" - https://www.rabbitmq.com/parameters.html#how-policies-work
I've added a similar comment to the linked patchset.
On 13/01/22 7:26 am, Albert Braden wrote:
This is very helpful. Thank you! It appears that I have successfully set the expire time to 1200, because I no longer see unconsumed messages lingering in my queues, but it's not obvious how to verify. In the web interface, when I look at the queues, I see things like policy, state, features and consumers, but I don't see a timeout or expire value, nor do I find the number 1200 anywhere. Where should I be looking in the web interface to verify that I set the expire time correctly? Or do I need to use the CLI? On Wednesday, January 5, 2022, 04:23:29 AM EST, Mark Goddard <mark@stackhpc.com> wrote:
On Tue, 4 Jan 2022 at 14:08, Albert Braden <ozzzo@yahoo.com> wrote:
Now that the holidays are over I'm trying this one again. Can anyone help me figure out how to set "expires" and "message-ttl" ?
John Garbutt proposed a few patches for RabbitMQ in kolla, including this: https://review.opendev.org/c/openstack/kolla-ansible/+/822191
https://review.opendev.org/q/hashtag:%2522rabbitmq%2522+(status:open+OR+stat...
Note that they are currently untested.
Mark
On Thursday, December 16, 2021, 01:43:57 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
I tried these policies in ansible/roles/rabbitmq/templates/definitions.json.j2:
"policies":[ {"vhost": "/", "name": "ha-all", "pattern": '^(?!(amq\.)|(.*_fanout_)|(reply_)).*', "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, {"vhost": "/", "name": "notifications-ttl", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"message-ttl":600}, "priority":0} {"vhost": "/", "name": "notifications-expire", "pattern": "^(notifications|versioned_notifications)\\.", "apply-to": "queues", "definition": {"expire":3600}, "priority":0} {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} {% endif %}
But I still see unconsumed messages lingering in notifications_extractor.info. From reading the docs I think this setting should cause messages to expire after 600 seconds, and unused queues to be deleted after 3600 seconds. What am I missing? On Tuesday, December 14, 2021, 04:18:09 PM EST, Albert Braden <ozzzo@yahoo.com> wrote:
Following [1] I successfully set "amqp_durable_queues = True" and restricted HA to the appropriate queues, but I'm having trouble with some of the other settings such as "expires" and "message-ttl". Does anyone have an example of a working kolla config that includes these changes?
[1] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit On Monday, December 13, 2021, 07:51:32 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
So, your config snippet LGTM.
Le ven. 10 déc. 2021 à 17:50, Albert Braden <ozzzo@yahoo.com> a écrit :
Sorry, that was a transcription error. I thought "True" and my fingers typed "False." The correct lines are:
[oslo_messaging_rabbit] amqp_durable_queues = True
On Friday, December 10, 2021, 02:55:55 AM EST, Herve Beraud <hberaud@redhat.com> wrote:
If you plan to let `amqp_durable_queues = False` (i.e if you plan to keep this config equal to false), then you don't need to add these config lines as this is already the default value [1].
[1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messagin...
Le jeu. 9 déc. 2021 à 22:40, Albert Braden <ozzzo@yahoo.com> a écrit :
Replying from my home email because I've been asked to not email the list from my work email anymore, until I get permission from upper management.
I'm not sure I follow. I was planning to add 2 lines to etc/kolla/config/global.conf:
[oslo_messaging_rabbit] amqp_durable_queues = False
Is that not sufficient? What is involved in configuring dedicated control exchanges for each service? What would that look like in the config?
From: Herve Beraud <hberaud@redhat.com> Sent: Thursday, December 9, 2021 2:45 AM To: Bogdan Dobrelya <bdobreli@redhat.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [ops] [kolla] RabbitMQ High Availability
Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Le mer. 8 déc. 2021 à 11:48, Bogdan Dobrelya <bdobreli@redhat.com> a écrit :
Please see inline
I read this with great interest because we are seeing this issue. Questions:
1. We are running kola-ansible Train, and our RMQ version is 3.7.23. Should we be upgrading our Train clusters to use 3.8.x? 2. Document [2] recommends policy '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'. I don't see this in our ansible playbooks, nor in any of the config files in the RMQ container. What would this look like in Ansible, and what should the resulting container config look like? 3. It appears that we are not setting "amqp_durable_queues = True". What does this setting look like in Ansible, and what file does it go into?
Note that even having rabbit HA policies adjusted like that and its HA replication factor [0] decreased (e.g. to a 2), there still might be high churn caused by a large enough number of replicated durable RPC topic queues. And that might cripple the cloud down with the incurred I/O overhead because a durable queue requires all messages in it to be persisted to a disk (for all the messaging cluster replicas) before they are ack'ed by the broker.
Given that said, Oslo messaging would likely require a more granular control for topic exchanges and the durable queues flag - to tell it to declare as durable only the most critical paths of a service. A single config setting and a single control exchange per a service might be not enough.
Also note that therefore, amqp_durable_queue=True requires dedicated control exchanges configured for each service. Those that use 'openstack' as a default cannot turn the feature ON. Changing it to a service specific might also cause upgrade impact, as described in the topic [3].
The same is true for `amqp_auto_delete=True`. That requires dedicated control exchanges else it won't work if each service defines its own policy on a shared control exchange (e.g `openstack`) and if policies differ from each other.
[3] https://review.opendev.org/q/topic:scope-config-opts
There are also race conditions with durable queues enabled, like [1]. A solution could be where each service declare its own dedicated control exchange with its own configuration.
Finally, openstack components should add perhaps a *.next CI job to test it with durable queues, like [2]
[0] https://www.rabbitmq.com/ha.html#replication-factor
[1] https://zuul.opendev.org/t/openstack/build/aa514dd788f34cc1be3800e6d7dba0e8/...
[2] https://review.opendev.org/c/openstack/nova/+/820523
Does anyone have a sample set of RMQ config files that they can share?
It looks like my Outlook has ruined the link; reposting: [2] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
-- Best regards, Bogdan Dobrelya, Irc #bogdando
-- Best regards, Bogdan Dobrelya, Irc #bogdando
--
Hervé Beraud
Senior Software Engineer at Red Hat
irc: hberaud
https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/ https://twitter.com/4383hberaud
participants (6)
-
Albert Braden
-
Arnaud
-
Dale Smith
-
Doug Szumski
-
Herve Beraud
-
Mark Goddard