Re: [openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down
Hi all, It seems worth noting here that there is a fix ongoing in oslo.messaging which will resolve the issues with HA failing when one node is down. See here: https://review.opendev.org/c/openstack/oslo.messaging/+/866617 In the meantime, we have also found that setting kombu_reconnect_delay = 0.5 does resolve this issue. As for why om_enable_rabbitmq_high_availability is currently defaulting to false, as Michal said enabling it in stable releases will impact users. This is because it enables durable queues, and the migration from transient to durable queues is not a seamless procedure. It requires that the state of RabbitMQ is reset and that the OpenStack services which use RabbitMQ are restarted to recreate the queues. I think that there is some merit in changing this default value. But if we did this, we should either add additional support to automate the migration from transient to durable queues, or at the very least provide some decent docs on the manual procedure. However, as classic queue mirroring is deprecated in RabbitMQ (to be removed in RabbitMQ 4.0) we should maybe consider switching to quorum queues soon. Then it may be beneficial to leave the classic queue mirroring + durable queues setup as False by default. This is because the migration between queue types (durable or quorum) can take several hours on larger deployments. So it might be worth making sure the default values only require one migration to quorum queues in the future, rather than two (durable queues now and then quorum queues in the future). We will need to make this switch eventually, but right now RabbitMQ 4.0 does not even have a set release date, so it's not the most urgent change. Cheers, Matt
Hi Michal,
Feel free to propose change of default in master branch, but I don?t think we can change the default in stable branches without impacting users.
Best regards, Michal
On 11 Apr 2023, at 15:18, Michal Arbet <michal.arbet@ultimum.io> wrote:
Hi,
Btw, why we have such option set to false ? Michal Arbet Openstack Engineer
Ultimum Technologies a.s. Na Po???? 1047/26, 11000 Praha 1 Czech Republic
+420 604 228 897 <> michal.arbet@ultimum.io <mailto:michal.arbet@ultimum.io> https://ultimum.io <https://ultimum.io/>
LinkedIn <https://www.linkedin.com/company/ultimum-technologies> | Twitter <https://twitter.com/ultimumtech> | Facebook <https://www.facebook.com/ultimumtechnologies/timeline>
?t 11. 4. 2023 v 14:48 odes?latel Micha? Nasiadka <mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>> napsal:
Hello,
RabbitMQ HA has been backported into stable releases, and it?s documented here: https://docs.openstack.org/kolla-ansible/yoga/reference/message-queues/rabbi...
Best regards, Michal
W dniu wt., 11.04.2023 o 13:32 Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> napisa?(a):
Yes. But cluster cannot work properly without it. :(
On Tue, Apr 11, 2023, 6:18 PM Danny Webb <Danny.Webb@thehutgroup.com <mailto:Danny.Webb@thehutgroup.com>> wrote:
This commit explains why they largely removed HA queue durability:
https://opendev.org/openstack/kolla-ansible/commit/2764844ee2ff9393a4eebd90a... From: Satish Patel <satish.txt@gmail.com <mailto:satish.txt@gmail.com>> Sent: 09 April 2023 04:16 To: Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> Cc: OpenStack Discuss <openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down
CAUTION: This email originates from outside THG
Are you proposing a solution or just raising an issue?
I did find it strange that kolla-ansible doesn't support HA queue by default. That is a disaster because when one of the nodes goes down it will make the whole rabbitMQ unacceptable. Whenever i deploy kolla i have to add HA policy to make queue HA otherwise you will endup in problem.
On Sat, Apr 8, 2023 at 6:40?AM Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> wrote: Hello everyone.
I want to summary for who meets problems with Openstack when deploy cluster with 3 controller using Kolla Ansible
Scenario: 1 of 3 controller is down
1. Login horizon and use API such as nova, cinder will be very slow
fix by:
nano: kolla-ansible/ansible/roles/heat/templates/heat.conf.j2 kolla-ansible/ansible/roles/nova/templates/nova.conf.j2 kolla-ansible/ansible/roles/keystone/templates/keystone.conf.j2 kolla-ansible/ansible/roles/neutron/templates/neutron.conf.j2 kolla-ansible/ansible/roles/cinder/templates/cinder.conf.j2
or which service need caches
add as below
[cache] backend = oslo_cache.memcache_pool enabled = True memcache_servers = {{ kolla_internal_vip_address }}:{{ memcached_port }} memcache_dead_retry = 0.25 memcache_socket_timeout = 900
https://review.opendev.org/c/openstack/kolla-ansible/+/849487
but it is not the end
2. Cannot launch instance or mapping block device(stuck at this step)
nano kolla-ansible/ansible/roles/rabbitmq/templates/definitions.json.j2
"policies":[ {"vhost": "/", "name": "ha-all", "pattern": "^(?!(amq\.)|(.*_fanout_)|(reply_)).*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} {% endif %} ]
nano /etc/kollla/global.conf
[oslo_messaging_rabbit] kombu_reconnect_delay=0.5
https://bugs.launchpad.net/oslo.messaging/+bug/1993149 https://docs.openstack.org/large-scale/journey/configure/rabbitmq.html
I used Xena 13.4 and Yoga 14.8.1.
Above bugs are critical, but I see that it was not fixed. I am just an operator and I want to share what I encountered for new people who come to Openstack
Nguyen Huu Khoi -- Micha? Nasiadka mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>
Hi Matt, How do I set kombu_reconnect_delay=0.5 option? Something like the following in global.yml? kombu_reconnect_delay: 0.5 On Wed, Apr 12, 2023 at 4:23 AM Matt Crees <mattc@stackhpc.com> wrote:
Hi all,
It seems worth noting here that there is a fix ongoing in oslo.messaging which will resolve the issues with HA failing when one node is down. See here: https://review.opendev.org/c/openstack/oslo.messaging/+/866617 In the meantime, we have also found that setting kombu_reconnect_delay = 0.5 does resolve this issue.
As for why om_enable_rabbitmq_high_availability is currently defaulting to false, as Michal said enabling it in stable releases will impact users. This is because it enables durable queues, and the migration from transient to durable queues is not a seamless procedure. It requires that the state of RabbitMQ is reset and that the OpenStack services which use RabbitMQ are restarted to recreate the queues.
I think that there is some merit in changing this default value. But if we did this, we should either add additional support to automate the migration from transient to durable queues, or at the very least provide some decent docs on the manual procedure.
However, as classic queue mirroring is deprecated in RabbitMQ (to be removed in RabbitMQ 4.0) we should maybe consider switching to quorum queues soon. Then it may be beneficial to leave the classic queue mirroring + durable queues setup as False by default. This is because the migration between queue types (durable or quorum) can take several hours on larger deployments. So it might be worth making sure the default values only require one migration to quorum queues in the future, rather than two (durable queues now and then quorum queues in the future).
We will need to make this switch eventually, but right now RabbitMQ 4.0 does not even have a set release date, so it's not the most urgent change.
Cheers, Matt
Hi Michal,
Feel free to propose change of default in master branch, but I don?t think we can change the default in stable branches without impacting users.
Best regards, Michal
On 11 Apr 2023, at 15:18, Michal Arbet <michal.arbet@ultimum.io> wrote:
Hi,
Btw, why we have such option set to false ? Michal Arbet Openstack Engineer
Ultimum Technologies a.s. Na Po???? 1047/26, 11000 Praha 1 Czech Republic
+420 604 228 897 <> michal.arbet@ultimum.io <mailto:michal.arbet@ultimum.io> https://ultimum.io <https://ultimum.io/>
LinkedIn <https://www.linkedin.com/company/ultimum-technologies> | Twitter <https://twitter.com/ultimumtech> | Facebook < https://www.facebook.com/ultimumtechnologies/timeline>
?t 11. 4. 2023 v 14:48 odes?latel Micha? Nasiadka <mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>> napsal:
Hello,
RabbitMQ HA has been backported into stable releases, and it?s documented here:
https://docs.openstack.org/kolla-ansible/yoga/reference/message-queues/rabbi...
Best regards, Michal
W dniu wt., 11.04.2023 o 13:32 Nguy?n H?u Kh?i <
nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> napisa?(a):
Yes. But cluster cannot work properly without it. :(
On Tue, Apr 11, 2023, 6:18 PM Danny Webb <Danny.Webb@thehutgroup.com <mailto:Danny.Webb@thehutgroup.com>> wrote:
This commit explains why they largely removed HA queue durability:
https://opendev.org/openstack/kolla-ansible/commit/2764844ee2ff9393a4eebd90a...
From: Satish Patel <satish.txt@gmail.com <mailto: satish.txt@gmail.com>> Sent: 09 April 2023 04:16 To: Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto: nguyenhuukhoinw@gmail.com>> Cc: OpenStack Discuss <openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down
CAUTION: This email originates from outside THG
Are you proposing a solution or just raising an issue?
I did find it strange that kolla-ansible doesn't support HA queue by default. That is a disaster because when one of the nodes goes down it will make the whole rabbitMQ unacceptable. Whenever i deploy kolla i have to add HA policy to make queue HA otherwise you will endup in problem.
On Sat, Apr 8, 2023 at 6:40?AM Nguy?n H?u Kh?i < nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> wrote: Hello everyone.
I want to summary for who meets problems with Openstack when deploy cluster with 3 controller using Kolla Ansible
Scenario: 1 of 3 controller is down
1. Login horizon and use API such as nova, cinder will be very slow
fix by:
nano: kolla-ansible/ansible/roles/heat/templates/heat.conf.j2 kolla-ansible/ansible/roles/nova/templates/nova.conf.j2 kolla-ansible/ansible/roles/keystone/templates/keystone.conf.j2 kolla-ansible/ansible/roles/neutron/templates/neutron.conf.j2 kolla-ansible/ansible/roles/cinder/templates/cinder.conf.j2
or which service need caches
add as below
[cache] backend = oslo_cache.memcache_pool enabled = True memcache_servers = {{ kolla_internal_vip_address }}:{{ memcached_port }} memcache_dead_retry = 0.25 memcache_socket_timeout = 900
https://review.opendev.org/c/openstack/kolla-ansible/+/849487
but it is not the end
2. Cannot launch instance or mapping block device(stuck at this step)
nano kolla-ansible/ansible/roles/rabbitmq/templates/definitions.json.j2
"policies":[ {"vhost": "/", "name": "ha-all", "pattern": "^(?!(amq\.)|(.*_fanout_)|(reply_)).*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} {% endif %} ]
nano /etc/kollla/global.conf
[oslo_messaging_rabbit] kombu_reconnect_delay=0.5
https://docs.openstack.org/large-scale/journey/configure/rabbitmq.html
I used Xena 13.4 and Yoga 14.8.1.
Above bugs are critical, but I see that it was not fixed. I am just
an operator and I want to share what I encountered for new people who come to Openstack
Nguyen Huu Khoi
-- Micha? Nasiadka mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>
Hi. Create global.conf in /etc/kolla/config/ On Wed, Apr 12, 2023, 9:42 PM Satish Patel <satish.txt@gmail.com> wrote:
Hi Matt,
How do I set kombu_reconnect_delay=0.5 option?
Something like the following in global.yml?
kombu_reconnect_delay: 0.5
On Wed, Apr 12, 2023 at 4:23 AM Matt Crees <mattc@stackhpc.com> wrote:
Hi all,
It seems worth noting here that there is a fix ongoing in oslo.messaging which will resolve the issues with HA failing when one node is down. See here: https://review.opendev.org/c/openstack/oslo.messaging/+/866617 In the meantime, we have also found that setting kombu_reconnect_delay = 0.5 does resolve this issue.
As for why om_enable_rabbitmq_high_availability is currently defaulting to false, as Michal said enabling it in stable releases will impact users. This is because it enables durable queues, and the migration from transient to durable queues is not a seamless procedure. It requires that the state of RabbitMQ is reset and that the OpenStack services which use RabbitMQ are restarted to recreate the queues.
I think that there is some merit in changing this default value. But if we did this, we should either add additional support to automate the migration from transient to durable queues, or at the very least provide some decent docs on the manual procedure.
However, as classic queue mirroring is deprecated in RabbitMQ (to be removed in RabbitMQ 4.0) we should maybe consider switching to quorum queues soon. Then it may be beneficial to leave the classic queue mirroring + durable queues setup as False by default. This is because the migration between queue types (durable or quorum) can take several hours on larger deployments. So it might be worth making sure the default values only require one migration to quorum queues in the future, rather than two (durable queues now and then quorum queues in the future).
We will need to make this switch eventually, but right now RabbitMQ 4.0 does not even have a set release date, so it's not the most urgent change.
Cheers, Matt
Hi Michal,
Feel free to propose change of default in master branch, but I don?t think we can change the default in stable branches without impacting users.
Best regards, Michal
On 11 Apr 2023, at 15:18, Michal Arbet <michal.arbet@ultimum.io> wrote:
Hi,
Btw, why we have such option set to false ? Michal Arbet Openstack Engineer
Ultimum Technologies a.s. Na Po???? 1047/26, 11000 Praha 1 Czech Republic
+420 604 228 897 <> michal.arbet@ultimum.io <mailto:michal.arbet@ultimum.io> https://ultimum.io <https://ultimum.io/>
LinkedIn <https://www.linkedin.com/company/ultimum-technologies> | Twitter <https://twitter.com/ultimumtech> | Facebook < https://www.facebook.com/ultimumtechnologies/timeline>
?t 11. 4. 2023 v 14:48 odes?latel Micha? Nasiadka <mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>> napsal:
Hello,
RabbitMQ HA has been backported into stable releases, and it?s documented here:
https://docs.openstack.org/kolla-ansible/yoga/reference/message-queues/rabbi...
Best regards, Michal
W dniu wt., 11.04.2023 o 13:32 Nguy?n H?u Kh?i <
nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> napisa?(a):
Yes. But cluster cannot work properly without it. :(
On Tue, Apr 11, 2023, 6:18 PM Danny Webb <Danny.Webb@thehutgroup.com <mailto:Danny.Webb@thehutgroup.com>> wrote: > This commit explains why they largely removed HA queue durability: > > https://opendev.org/openstack/kolla-ansible/commit/2764844ee2ff9393a4eebd90a... > From: Satish Patel <satish.txt@gmail.com <mailto: satish.txt@gmail.com>> > Sent: 09 April 2023 04:16 > To: Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto: nguyenhuukhoinw@gmail.com>> > Cc: OpenStack Discuss <openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org>> > Subject: Re: [openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down > > > CAUTION: This email originates from outside THG > > Are you proposing a solution or just raising an issue? > > I did find it strange that kolla-ansible doesn't support HA queue by default. That is a disaster because when one of the nodes goes down it will make the whole rabbitMQ unacceptable. Whenever i deploy kolla i have to add HA policy to make queue HA otherwise you will endup in problem. > > On Sat, Apr 8, 2023 at 6:40?AM Nguy?n H?u Kh?i < nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> wrote: > Hello everyone. > > I want to summary for who meets problems with Openstack when deploy cluster with 3 controller using Kolla Ansible > > Scenario: 1 of 3 controller is down > > 1. Login horizon and use API such as nova, cinder will be very slow > > fix by: > > nano: > kolla-ansible/ansible/roles/heat/templates/heat.conf.j2 > kolla-ansible/ansible/roles/nova/templates/nova.conf.j2 > kolla-ansible/ansible/roles/keystone/templates/keystone.conf.j2 > kolla-ansible/ansible/roles/neutron/templates/neutron.conf.j2 > kolla-ansible/ansible/roles/cinder/templates/cinder.conf.j2 > > or which service need caches > > add as below > > [cache] > backend = oslo_cache.memcache_pool > enabled = True > memcache_servers = {{ kolla_internal_vip_address }}:{{ memcached_port }} > memcache_dead_retry = 0.25 > memcache_socket_timeout = 900 > > https://review.opendev.org/c/openstack/kolla-ansible/+/849487 > > but it is not the end > > 2. Cannot launch instance or mapping block device(stuck at this step) > > nano kolla-ansible/ansible/roles/rabbitmq/templates/definitions.json.j2 > > "policies":[ > {"vhost": "/", "name": "ha-all", "pattern": "^(?!(amq\.)|(.*_fanout_)|(reply_)).*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, > {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} > {% endif %} > ] > > nano /etc/kollla/global.conf > > [oslo_messaging_rabbit] > kombu_reconnect_delay=0.5 > > > https://bugs.launchpad.net/oslo.messaging/+bug/1993149 > https://docs.openstack.org/large-scale/journey/configure/rabbitmq.html > > I used Xena 13.4 and Yoga 14.8.1. > > Above bugs are critical, but I see that it was not fixed. I am just an operator and I want to share what I encountered for new people who come to Openstack > > > Nguyen Huu Khoi -- Micha? Nasiadka mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>
Yes, and the option also needs to be under the oslo_messaging_rabbit heading: [oslo_messaging_rabbit] kombu_reconnect_delay=0.5 On Wed, 12 Apr 2023 at 15:45, Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> wrote:
Hi. Create global.conf in /etc/kolla/config/
On Wed, Apr 12, 2023, 9:42 PM Satish Patel <satish.txt@gmail.com> wrote:
Hi Matt,
How do I set kombu_reconnect_delay=0.5 option?
Something like the following in global.yml?
kombu_reconnect_delay: 0.5
On Wed, Apr 12, 2023 at 4:23 AM Matt Crees <mattc@stackhpc.com> wrote:
Hi all,
It seems worth noting here that there is a fix ongoing in oslo.messaging which will resolve the issues with HA failing when one node is down. See here: https://review.opendev.org/c/openstack/oslo.messaging/+/866617 In the meantime, we have also found that setting kombu_reconnect_delay = 0.5 does resolve this issue.
As for why om_enable_rabbitmq_high_availability is currently defaulting to false, as Michal said enabling it in stable releases will impact users. This is because it enables durable queues, and the migration from transient to durable queues is not a seamless procedure. It requires that the state of RabbitMQ is reset and that the OpenStack services which use RabbitMQ are restarted to recreate the queues.
I think that there is some merit in changing this default value. But if we did this, we should either add additional support to automate the migration from transient to durable queues, or at the very least provide some decent docs on the manual procedure.
However, as classic queue mirroring is deprecated in RabbitMQ (to be removed in RabbitMQ 4.0) we should maybe consider switching to quorum queues soon. Then it may be beneficial to leave the classic queue mirroring + durable queues setup as False by default. This is because the migration between queue types (durable or quorum) can take several hours on larger deployments. So it might be worth making sure the default values only require one migration to quorum queues in the future, rather than two (durable queues now and then quorum queues in the future).
We will need to make this switch eventually, but right now RabbitMQ 4.0 does not even have a set release date, so it's not the most urgent change.
Cheers, Matt
Hi Michal,
Feel free to propose change of default in master branch, but I don?t think we can change the default in stable branches without impacting users.
Best regards, Michal
On 11 Apr 2023, at 15:18, Michal Arbet <michal.arbet@ultimum.io> wrote:
Hi,
Btw, why we have such option set to false ? Michal Arbet Openstack Engineer
Ultimum Technologies a.s. Na Po???? 1047/26, 11000 Praha 1 Czech Republic
+420 604 228 897 <> michal.arbet@ultimum.io <mailto:michal.arbet@ultimum.io> https://ultimum.io <https://ultimum.io/>
LinkedIn <https://www.linkedin.com/company/ultimum-technologies> | Twitter <https://twitter.com/ultimumtech> | Facebook <https://www.facebook.com/ultimumtechnologies/timeline>
?t 11. 4. 2023 v 14:48 odes?latel Micha? Nasiadka <mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>> napsal:
Hello,
RabbitMQ HA has been backported into stable releases, and it?s documented here: https://docs.openstack.org/kolla-ansible/yoga/reference/message-queues/rabbi...
Best regards, Michal
W dniu wt., 11.04.2023 o 13:32 Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> napisa?(a): > Yes. > But cluster cannot work properly without it. :( > > On Tue, Apr 11, 2023, 6:18 PM Danny Webb <Danny.Webb@thehutgroup.com <mailto:Danny.Webb@thehutgroup.com>> wrote: >> This commit explains why they largely removed HA queue durability: >> >> https://opendev.org/openstack/kolla-ansible/commit/2764844ee2ff9393a4eebd90a... >> From: Satish Patel <satish.txt@gmail.com <mailto:satish.txt@gmail.com>> >> Sent: 09 April 2023 04:16 >> To: Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> >> Cc: OpenStack Discuss <openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org>> >> Subject: Re: [openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down >> >> >> CAUTION: This email originates from outside THG >> >> Are you proposing a solution or just raising an issue? >> >> I did find it strange that kolla-ansible doesn't support HA queue by default. That is a disaster because when one of the nodes goes down it will make the whole rabbitMQ unacceptable. Whenever i deploy kolla i have to add HA policy to make queue HA otherwise you will endup in problem. >> >> On Sat, Apr 8, 2023 at 6:40?AM Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> wrote: >> Hello everyone. >> >> I want to summary for who meets problems with Openstack when deploy cluster with 3 controller using Kolla Ansible >> >> Scenario: 1 of 3 controller is down >> >> 1. Login horizon and use API such as nova, cinder will be very slow >> >> fix by: >> >> nano: >> kolla-ansible/ansible/roles/heat/templates/heat.conf.j2 >> kolla-ansible/ansible/roles/nova/templates/nova.conf.j2 >> kolla-ansible/ansible/roles/keystone/templates/keystone.conf.j2 >> kolla-ansible/ansible/roles/neutron/templates/neutron.conf.j2 >> kolla-ansible/ansible/roles/cinder/templates/cinder.conf.j2 >> >> or which service need caches >> >> add as below >> >> [cache] >> backend = oslo_cache.memcache_pool >> enabled = True >> memcache_servers = {{ kolla_internal_vip_address }}:{{ memcached_port }} >> memcache_dead_retry = 0.25 >> memcache_socket_timeout = 900 >> >> https://review.opendev.org/c/openstack/kolla-ansible/+/849487 >> >> but it is not the end >> >> 2. Cannot launch instance or mapping block device(stuck at this step) >> >> nano kolla-ansible/ansible/roles/rabbitmq/templates/definitions.json.j2 >> >> "policies":[ >> {"vhost": "/", "name": "ha-all", "pattern": "^(?!(amq\.)|(.*_fanout_)|(reply_)).*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, >> {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} >> {% endif %} >> ] >> >> nano /etc/kollla/global.conf >> >> [oslo_messaging_rabbit] >> kombu_reconnect_delay=0.5 >> >> >> https://bugs.launchpad.net/oslo.messaging/+bug/1993149 >> https://docs.openstack.org/large-scale/journey/configure/rabbitmq.html >> >> I used Xena 13.4 and Yoga 14.8.1. >> >> Above bugs are critical, but I see that it was not fixed. I am just an operator and I want to share what I encountered for new people who come to Openstack >> >> >> Nguyen Huu Khoi -- Micha? Nasiadka mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>
Matt, For new deployment how do I enable the Quorum queue? Just adding the following should be enough? om_enable_rabbitmq_high_availability: True On Wed, Apr 12, 2023 at 10:54 AM Matt Crees <mattc@stackhpc.com> wrote:
Yes, and the option also needs to be under the oslo_messaging_rabbit heading:
[oslo_messaging_rabbit] kombu_reconnect_delay=0.5
On Wed, 12 Apr 2023 at 15:45, Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> wrote:
Hi. Create global.conf in /etc/kolla/config/
On Wed, Apr 12, 2023, 9:42 PM Satish Patel <satish.txt@gmail.com> wrote:
Hi Matt,
How do I set kombu_reconnect_delay=0.5 option?
Something like the following in global.yml?
kombu_reconnect_delay: 0.5
On Wed, Apr 12, 2023 at 4:23 AM Matt Crees <mattc@stackhpc.com> wrote:
Hi all,
It seems worth noting here that there is a fix ongoing in oslo.messaging which will resolve the issues with HA failing when one node is down. See here: https://review.opendev.org/c/openstack/oslo.messaging/+/866617 In the meantime, we have also found that setting kombu_reconnect_delay = 0.5 does resolve this issue.
As for why om_enable_rabbitmq_high_availability is currently defaulting to false, as Michal said enabling it in stable releases will impact users. This is because it enables durable queues, and the migration from transient to durable queues is not a seamless procedure. It requires that the state of RabbitMQ is reset and that the OpenStack services which use RabbitMQ are restarted to recreate the queues.
I think that there is some merit in changing this default value. But if we did this, we should either add additional support to automate the migration from transient to durable queues, or at the very least provide some decent docs on the manual procedure.
However, as classic queue mirroring is deprecated in RabbitMQ (to be removed in RabbitMQ 4.0) we should maybe consider switching to quorum queues soon. Then it may be beneficial to leave the classic queue mirroring + durable queues setup as False by default. This is because the migration between queue types (durable or quorum) can take several hours on larger deployments. So it might be worth making sure the default values only require one migration to quorum queues in the future, rather than two (durable queues now and then quorum queues in the future).
We will need to make this switch eventually, but right now RabbitMQ 4.0 does not even have a set release date, so it's not the most urgent change.
Cheers, Matt
Hi Michal,
Feel free to propose change of default in master branch, but I don?t
Best regards, Michal
On 11 Apr 2023, at 15:18, Michal Arbet <michal.arbet@ultimum.io>
wrote:
Hi,
Btw, why we have such option set to false ? Michal Arbet Openstack Engineer
Ultimum Technologies a.s. Na Po???? 1047/26, 11000 Praha 1 Czech Republic
+420 604 228 897 <> michal.arbet@ultimum.io <mailto:michal.arbet@ultimum.io> https://ultimum.io <https://ultimum.io/>
LinkedIn <https://www.linkedin.com/company/ultimum-technologies> |
Twitter <https://twitter.com/ultimumtech> | Facebook < https://www.facebook.com/ultimumtechnologies/timeline>
?t 11. 4. 2023 v 14:48 odes?latel Micha? Nasiadka <
mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>> napsal:
> Hello, > > RabbitMQ HA has been backported into stable releases, and it?s documented here: > https://docs.openstack.org/kolla-ansible/yoga/reference/message-queues/rabbi... > > Best regards, > Michal > > W dniu wt., 11.04.2023 o 13:32 Nguy?n H?u Kh?i < nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> napisa?(a): >> Yes. >> But cluster cannot work properly without it. :( >> >> On Tue, Apr 11, 2023, 6:18 PM Danny Webb < Danny.Webb@thehutgroup.com <mailto:Danny.Webb@thehutgroup.com>> wrote: >>> This commit explains why they largely removed HA queue durability: >>> >>> https://opendev.org/openstack/kolla-ansible/commit/2764844ee2ff9393a4eebd90a... >>> From: Satish Patel <satish.txt@gmail.com <mailto: satish.txt@gmail.com>> >>> Sent: 09 April 2023 04:16 >>> To: Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto: nguyenhuukhoinw@gmail.com>> >>> Cc: OpenStack Discuss <openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org>> >>> Subject: Re: [openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down >>> >>> >>> CAUTION: This email originates from outside THG >>> >>> Are you proposing a solution or just raising an issue? >>> >>> I did find it strange that kolla-ansible doesn't support HA queue by default. That is a disaster because when one of the nodes goes down it will make the whole rabbitMQ unacceptable. Whenever i deploy kolla i have to add HA policy to make queue HA otherwise you will endup in
think we can change the default in stable branches without impacting users. problem.
>>> >>> On Sat, Apr 8, 2023 at 6:40?AM Nguy?n H?u Kh?i < nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> wrote: >>> Hello everyone. >>> >>> I want to summary for who meets problems with Openstack when deploy cluster with 3 controller using Kolla Ansible >>> >>> Scenario: 1 of 3 controller is down >>> >>> 1. Login horizon and use API such as nova, cinder will be very slow >>> >>> fix by: >>> >>> nano: >>> kolla-ansible/ansible/roles/heat/templates/heat.conf.j2 >>> kolla-ansible/ansible/roles/nova/templates/nova.conf.j2 >>> kolla-ansible/ansible/roles/keystone/templates/keystone.conf.j2 >>> kolla-ansible/ansible/roles/neutron/templates/neutron.conf.j2 >>> kolla-ansible/ansible/roles/cinder/templates/cinder.conf.j2 >>> >>> or which service need caches >>> >>> add as below >>> >>> [cache] >>> backend = oslo_cache.memcache_pool >>> enabled = True >>> memcache_servers = {{ kolla_internal_vip_address }}:{{ memcached_port }} >>> memcache_dead_retry = 0.25 >>> memcache_socket_timeout = 900 >>> >>> https://review.opendev.org/c/openstack/kolla-ansible/+/849487 >>> >>> but it is not the end >>> >>> 2. Cannot launch instance or mapping block device(stuck at this step) >>> >>> nano kolla-ansible/ansible/roles/rabbitmq/templates/definitions.json.j2 >>> >>> "policies":[ >>> {"vhost": "/", "name": "ha-all", "pattern": "^(?!(amq\.)|(.*_fanout_)|(reply_)).*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, >>> {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} >>> {% endif %} >>> ] >>> >>> nano /etc/kollla/global.conf >>> >>> [oslo_messaging_rabbit] >>> kombu_reconnect_delay=0.5 >>> >>> >>> https://bugs.launchpad.net/oslo.messaging/+bug/1993149 >>> https://docs.openstack.org/large-scale/journey/configure/rabbitmq.html >>> >>> I used Xena 13.4 and Yoga 14.8.1. >>> >>> Above bugs are critical, but I see that it was not fixed. I am just an operator and I want to share what I encountered for new people who come to Openstack >>> >>> >>> Nguyen Huu Khoi > -- > Micha? Nasiadka > mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>
Hi Satish, Yes for a new deployment you will just need to set that variable to true. However, that will enable the high availability of RabbitMQ queues using a combination of classic queue mirroring and durable queues. Quorum queues are not yet supported via Kolla Ansible. Cheers, Matt On Wed, 12 Apr 2023 at 16:04, Satish Patel <satish.txt@gmail.com> wrote:
Matt,
For new deployment how do I enable the Quorum queue?
Just adding the following should be enough?
om_enable_rabbitmq_high_availability: True
On Wed, Apr 12, 2023 at 10:54 AM Matt Crees <mattc@stackhpc.com> wrote:
Yes, and the option also needs to be under the oslo_messaging_rabbit heading:
[oslo_messaging_rabbit] kombu_reconnect_delay=0.5
On Wed, 12 Apr 2023 at 15:45, Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> wrote:
Hi. Create global.conf in /etc/kolla/config/
On Wed, Apr 12, 2023, 9:42 PM Satish Patel <satish.txt@gmail.com> wrote:
Hi Matt,
How do I set kombu_reconnect_delay=0.5 option?
Something like the following in global.yml?
kombu_reconnect_delay: 0.5
On Wed, Apr 12, 2023 at 4:23 AM Matt Crees <mattc@stackhpc.com> wrote:
Hi all,
It seems worth noting here that there is a fix ongoing in oslo.messaging which will resolve the issues with HA failing when one node is down. See here: https://review.opendev.org/c/openstack/oslo.messaging/+/866617 In the meantime, we have also found that setting kombu_reconnect_delay = 0.5 does resolve this issue.
As for why om_enable_rabbitmq_high_availability is currently defaulting to false, as Michal said enabling it in stable releases will impact users. This is because it enables durable queues, and the migration from transient to durable queues is not a seamless procedure. It requires that the state of RabbitMQ is reset and that the OpenStack services which use RabbitMQ are restarted to recreate the queues.
I think that there is some merit in changing this default value. But if we did this, we should either add additional support to automate the migration from transient to durable queues, or at the very least provide some decent docs on the manual procedure.
However, as classic queue mirroring is deprecated in RabbitMQ (to be removed in RabbitMQ 4.0) we should maybe consider switching to quorum queues soon. Then it may be beneficial to leave the classic queue mirroring + durable queues setup as False by default. This is because the migration between queue types (durable or quorum) can take several hours on larger deployments. So it might be worth making sure the default values only require one migration to quorum queues in the future, rather than two (durable queues now and then quorum queues in the future).
We will need to make this switch eventually, but right now RabbitMQ 4.0 does not even have a set release date, so it's not the most urgent change.
Cheers, Matt
Hi Michal,
Feel free to propose change of default in master branch, but I don?t think we can change the default in stable branches without impacting users.
Best regards, Michal
> On 11 Apr 2023, at 15:18, Michal Arbet <michal.arbet@ultimum.io> wrote: > > Hi, > > Btw, why we have such option set to false ? > Michal Arbet > Openstack Engineer > > Ultimum Technologies a.s. > Na Po???? 1047/26, 11000 Praha 1 > Czech Republic > > +420 604 228 897 <> > michal.arbet@ultimum.io <mailto:michal.arbet@ultimum.io> > https://ultimum.io <https://ultimum.io/> > > LinkedIn <https://www.linkedin.com/company/ultimum-technologies> | Twitter <https://twitter.com/ultimumtech> | Facebook <https://www.facebook.com/ultimumtechnologies/timeline> > > > ?t 11. 4. 2023 v 14:48 odes?latel Micha? Nasiadka <mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>> napsal: >> Hello, >> >> RabbitMQ HA has been backported into stable releases, and it?s documented here: >> https://docs.openstack.org/kolla-ansible/yoga/reference/message-queues/rabbi... >> >> Best regards, >> Michal >> >> W dniu wt., 11.04.2023 o 13:32 Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> napisa?(a): >>> Yes. >>> But cluster cannot work properly without it. :( >>> >>> On Tue, Apr 11, 2023, 6:18 PM Danny Webb <Danny.Webb@thehutgroup.com <mailto:Danny.Webb@thehutgroup.com>> wrote: >>>> This commit explains why they largely removed HA queue durability: >>>> >>>> https://opendev.org/openstack/kolla-ansible/commit/2764844ee2ff9393a4eebd90a... >>>> From: Satish Patel <satish.txt@gmail.com <mailto:satish.txt@gmail.com>> >>>> Sent: 09 April 2023 04:16 >>>> To: Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> >>>> Cc: OpenStack Discuss <openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org>> >>>> Subject: Re: [openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down >>>> >>>> >>>> CAUTION: This email originates from outside THG >>>> >>>> Are you proposing a solution or just raising an issue? >>>> >>>> I did find it strange that kolla-ansible doesn't support HA queue by default. That is a disaster because when one of the nodes goes down it will make the whole rabbitMQ unacceptable. Whenever i deploy kolla i have to add HA policy to make queue HA otherwise you will endup in problem. >>>> >>>> On Sat, Apr 8, 2023 at 6:40?AM Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> wrote: >>>> Hello everyone. >>>> >>>> I want to summary for who meets problems with Openstack when deploy cluster with 3 controller using Kolla Ansible >>>> >>>> Scenario: 1 of 3 controller is down >>>> >>>> 1. Login horizon and use API such as nova, cinder will be very slow >>>> >>>> fix by: >>>> >>>> nano: >>>> kolla-ansible/ansible/roles/heat/templates/heat.conf.j2 >>>> kolla-ansible/ansible/roles/nova/templates/nova.conf.j2 >>>> kolla-ansible/ansible/roles/keystone/templates/keystone.conf.j2 >>>> kolla-ansible/ansible/roles/neutron/templates/neutron.conf.j2 >>>> kolla-ansible/ansible/roles/cinder/templates/cinder.conf.j2 >>>> >>>> or which service need caches >>>> >>>> add as below >>>> >>>> [cache] >>>> backend = oslo_cache.memcache_pool >>>> enabled = True >>>> memcache_servers = {{ kolla_internal_vip_address }}:{{ memcached_port }} >>>> memcache_dead_retry = 0.25 >>>> memcache_socket_timeout = 900 >>>> >>>> https://review.opendev.org/c/openstack/kolla-ansible/+/849487 >>>> >>>> but it is not the end >>>> >>>> 2. Cannot launch instance or mapping block device(stuck at this step) >>>> >>>> nano kolla-ansible/ansible/roles/rabbitmq/templates/definitions.json.j2 >>>> >>>> "policies":[ >>>> {"vhost": "/", "name": "ha-all", "pattern": "^(?!(amq\.)|(.*_fanout_)|(reply_)).*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, >>>> {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} >>>> {% endif %} >>>> ] >>>> >>>> nano /etc/kollla/global.conf >>>> >>>> [oslo_messaging_rabbit] >>>> kombu_reconnect_delay=0.5 >>>> >>>> >>>> https://bugs.launchpad.net/oslo.messaging/+bug/1993149 >>>> https://docs.openstack.org/large-scale/journey/configure/rabbitmq.html >>>> >>>> I used Xena 13.4 and Yoga 14.8.1. >>>> >>>> Above bugs are critical, but I see that it was not fixed. I am just an operator and I want to share what I encountered for new people who come to Openstack >>>> >>>> >>>> Nguyen Huu Khoi >> -- >> Micha? Nasiadka >> mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>
Matt, After enabling om_enable_rabbitmq_high_availability: True and kombu_reconnect_delay=0.5 all my api services started throwing the following logs. Even i rebuild my RabbitMQ cluster again. What could be wrong here? 2023-04-12 15:53:40.380 391 ERROR oslo_service.service amqp.exceptions.PreconditionFailed: Exchange.declare: (406) PRECONDITION_FAILED - inequivalent arg 'durable' for exchange 'neutron' in vhost '/': received 'true' but current is 'false' 2023-04-12 15:53:40.380 391 ERROR oslo_service.service 2023-04-12 15:53:40.380 391 ERROR oslo_service.service During handling of the above exception, another exception occurred: 2023-04-12 15:53:40.380 391 ERROR oslo_service.service 2023-04-12 15:53:40.380 391 ERROR oslo_service.service Traceback (most recent call last): 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_service/service.py", line 806, in run_service 2023-04-12 15:53:40.380 391 ERROR oslo_service.service service.start() 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/neutron/service.py", line 115, in start 2023-04-12 15:53:40.380 391 ERROR oslo_service.service servers = getattr(plugin, self.start_listeners_method)() 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_log/helpers.py", line 67, in wrapper 2023-04-12 15:53:40.380 391 ERROR oslo_service.service return method(*args, **kwargs) 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/neutron/plugins/ml2/plugin.py", line 425, in start_rpc_listeners 2023-04-12 15:53:40.380 391 ERROR oslo_service.service return self.conn.consume_in_threads() 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/neutron_lib/rpc.py", line 351, in consume_in_threads 2023-04-12 15:53:40.380 391 ERROR oslo_service.service server.start() 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/server.py", line 267, in wrapper 2023-04-12 15:53:40.380 391 ERROR oslo_service.service states[state].run_once(lambda: fn(self, *args, **kwargs), 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/server.py", line 188, in run_once 2023-04-12 15:53:40.380 391 ERROR oslo_service.service post_fn = fn() 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/server.py", line 267, in <lambda> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service states[state].run_once(lambda: fn(self, *args, **kwargs), 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/server.py", line 413, in start 2023-04-12 15:53:40.380 391 ERROR oslo_service.service self.listener = self._create_listener() 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 150, in _create_listener 2023-04-12 15:53:40.380 391 ERROR oslo_service.service return self.transport._listen(self._target, 1, None) 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/transport.py", line 142, in _listen 2023-04-12 15:53:40.380 391 ERROR oslo_service.service return self._driver.listen(target, batch_size, 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 702, in listen 2023-04-12 15:53:40.380 391 ERROR oslo_service.service conn.declare_topic_consumer(exchange_name=self._get_exchange(target), 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1295, in declare_topic_consumer 2023-04-12 15:53:40.380 391 ERROR oslo_service.service self.declare_consumer(consumer) 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1192, in declare_consumer 2023-04-12 15:53:40.380 391 ERROR oslo_service.service return self.ensure(_declare_consumer, 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 977, in ensure 2023-04-12 15:53:40.380 391 ERROR oslo_service.service raise exceptions.MessageDeliveryFailure(msg) 2023-04-12 15:53:40.380 391 ERROR oslo_service.service oslo_messaging.exceptions.MessageDeliveryFailure: Unable to connect to AMQP server on 10.30.50.3:5672 after inf tries: Exchange.declare: (406) PRECONDITION_FAILED - inequivalent arg 'durable' for exchange 'neutron' in vhost '/': received 'true' but current is 'false' 2023-04-12 15:53:40.380 391 ERROR oslo_service.service On Wed, Apr 12, 2023 at 11:10 AM Matt Crees <mattc@stackhpc.com> wrote:
Hi Satish,
Yes for a new deployment you will just need to set that variable to true. However, that will enable the high availability of RabbitMQ queues using a combination of classic queue mirroring and durable queues. Quorum queues are not yet supported via Kolla Ansible.
Cheers, Matt
On Wed, 12 Apr 2023 at 16:04, Satish Patel <satish.txt@gmail.com> wrote:
Matt,
For new deployment how do I enable the Quorum queue?
Just adding the following should be enough?
om_enable_rabbitmq_high_availability: True
On Wed, Apr 12, 2023 at 10:54 AM Matt Crees <mattc@stackhpc.com> wrote:
Yes, and the option also needs to be under the oslo_messaging_rabbit
[oslo_messaging_rabbit] kombu_reconnect_delay=0.5
On Wed, 12 Apr 2023 at 15:45, Nguyễn Hữu Khôi <
nguyenhuukhoinw@gmail.com> wrote:
Hi. Create global.conf in /etc/kolla/config/
On Wed, Apr 12, 2023, 9:42 PM Satish Patel <satish.txt@gmail.com>
wrote:
Hi Matt,
How do I set kombu_reconnect_delay=0.5 option?
Something like the following in global.yml?
kombu_reconnect_delay: 0.5
On Wed, Apr 12, 2023 at 4:23 AM Matt Crees <mattc@stackhpc.com>
wrote:
Hi all,
It seems worth noting here that there is a fix ongoing in oslo.messaging which will resolve the issues with HA failing when
one
node is down. See here: https://review.opendev.org/c/openstack/oslo.messaging/+/866617 In the meantime, we have also found that setting kombu_reconnect_delay = 0.5 does resolve this issue.
As for why om_enable_rabbitmq_high_availability is currently defaulting to false, as Michal said enabling it in stable releases will impact users. This is because it enables durable queues, and
migration from transient to durable queues is not a seamless procedure. It requires that the state of RabbitMQ is reset and that the OpenStack services which use RabbitMQ are restarted to recreate the queues.
I think that there is some merit in changing this default value. But if we did this, we should either add additional support to automate the migration from transient to durable queues, or at the very least provide some decent docs on the manual procedure.
However, as classic queue mirroring is deprecated in RabbitMQ (to be removed in RabbitMQ 4.0) we should maybe consider switching to quorum queues soon. Then it may be beneficial to leave the classic queue mirroring + durable queues setup as False by default. This is because the migration between queue types (durable or quorum) can take several hours on larger deployments. So it might be worth making sure the default values only require one migration to quorum queues in the future, rather than two (durable queues now and then quorum queues in the future).
We will need to make this switch eventually, but right now RabbitMQ 4.0 does not even have a set release date, so it's not the most urgent change.
Cheers, Matt
>Hi Michal, > >Feel free to propose change of default in master branch, but I don?t think we can change the default in stable branches without impacting users. > >Best regards, >Michal > >> On 11 Apr 2023, at 15:18, Michal Arbet <michal.arbet@ultimum.io> wrote: >> >> Hi, >> >> Btw, why we have such option set to false ? >> Michal Arbet >> Openstack Engineer >> >> Ultimum Technologies a.s. >> Na Po???? 1047/26, 11000 Praha 1 >> Czech Republic >> >> +420 604 228 897 <> >> michal.arbet@ultimum.io <mailto:michal.arbet@ultimum.io> >> https://ultimum.io <https://ultimum.io/> >> >> LinkedIn <https://www.linkedin.com/company/ultimum-technologies> | Twitter <https://twitter.com/ultimumtech> | Facebook < https://www.facebook.com/ultimumtechnologies/timeline> >> >> >> ?t 11. 4. 2023 v 14:48 odes?latel Micha? Nasiadka < mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>> napsal: >>> Hello, >>> >>> RabbitMQ HA has been backported into stable releases, and it?s documented here: >>> https://docs.openstack.org/kolla-ansible/yoga/reference/message-queues/rabbi... >>> >>> Best regards, >>> Michal >>> >>> W dniu wt., 11.04.2023 o 13:32 Nguy?n H?u Kh?i < nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> napisa?(a): >>>> Yes. >>>> But cluster cannot work properly without it. :( >>>> >>>> On Tue, Apr 11, 2023, 6:18 PM Danny Webb < Danny.Webb@thehutgroup.com <mailto:Danny.Webb@thehutgroup.com>> wrote: >>>>> This commit explains why they largely removed HA queue durability: >>>>> >>>>> https://opendev.org/openstack/kolla-ansible/commit/2764844ee2ff9393a4eebd90a... >>>>> From: Satish Patel <satish.txt@gmail.com <mailto: satish.txt@gmail.com>> >>>>> Sent: 09 April 2023 04:16 >>>>> To: Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto: nguyenhuukhoinw@gmail.com>> >>>>> Cc: OpenStack Discuss <openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org>> >>>>> Subject: Re: [openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down >>>>> >>>>> >>>>> CAUTION: This email originates from outside THG >>>>> >>>>> Are you proposing a solution or just raising an issue? >>>>> >>>>> I did find it strange that kolla-ansible doesn't support HA queue by default. That is a disaster because when one of the nodes goes down it will make the whole rabbitMQ unacceptable. Whenever i deploy kolla i have to add HA policy to make queue HA otherwise you will endup in
>>>>> >>>>> On Sat, Apr 8, 2023 at 6:40?AM Nguy?n H?u Kh?i < nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> wrote: >>>>> Hello everyone. >>>>> >>>>> I want to summary for who meets problems with Openstack when deploy cluster with 3 controller using Kolla Ansible >>>>> >>>>> Scenario: 1 of 3 controller is down >>>>> >>>>> 1. Login horizon and use API such as nova, cinder will be very slow >>>>> >>>>> fix by: >>>>> >>>>> nano: >>>>> kolla-ansible/ansible/roles/heat/templates/heat.conf.j2 >>>>> kolla-ansible/ansible/roles/nova/templates/nova.conf.j2 >>>>> kolla-ansible/ansible/roles/keystone/templates/keystone.conf.j2 >>>>> kolla-ansible/ansible/roles/neutron/templates/neutron.conf.j2 >>>>> kolla-ansible/ansible/roles/cinder/templates/cinder.conf.j2 >>>>> >>>>> or which service need caches >>>>> >>>>> add as below >>>>> >>>>> [cache] >>>>> backend = oslo_cache.memcache_pool >>>>> enabled = True >>>>> memcache_servers = {{ kolla_internal_vip_address }}:{{ memcached_port }} >>>>> memcache_dead_retry = 0.25 >>>>> memcache_socket_timeout = 900 >>>>> >>>>> https://review.opendev.org/c/openstack/kolla-ansible/+/849487 >>>>> >>>>> but it is not the end >>>>> >>>>> 2. Cannot launch instance or mapping block device(stuck at
heading: the problem. this step)
>>>>> >>>>> nano kolla-ansible/ansible/roles/rabbitmq/templates/definitions.json.j2 >>>>> >>>>> "policies":[ >>>>> {"vhost": "/", "name": "ha-all", "pattern": "^(?!(amq\.)|(.*_fanout_)|(reply_)).*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, >>>>> {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} >>>>> {% endif %} >>>>> ] >>>>> >>>>> nano /etc/kollla/global.conf >>>>> >>>>> [oslo_messaging_rabbit] >>>>> kombu_reconnect_delay=0.5 >>>>> >>>>> >>>>> https://bugs.launchpad.net/oslo.messaging/+bug/1993149 >>>>> https://docs.openstack.org/large-scale/journey/configure/rabbitmq.html >>>>> >>>>> I used Xena 13.4 and Yoga 14.8.1. >>>>> >>>>> Above bugs are critical, but I see that it was not fixed. I am just an operator and I want to share what I encountered for new people who come to Openstack >>>>> >>>>> >>>>> Nguyen Huu Khoi >>> -- >>> Micha? Nasiadka >>> mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>
Hi Satish, Apologies, I would have mentioned this before but I thought when you mentioned a new deployment you meant starting a completely fresh deploy. As you're now reconfiguring a running deployment, there are some extra steps that need to be taken to migrate to durable queues. 1. You will need to stop the OpenStack services which use RabbitMQ. 2. Reset the state of RabbitMQ one each rabbit node with the following commands. You must run each command on all RabbitMQ nodes before moving on to the next command. This will remove all queues. rabbitmqctl stop_app rabbitmqctl force_reset rabbitmqctl start_app 3. Start the OpenStack services again, at which point they will recreate the appropriate queues as durable. On Wed, 12 Apr 2023 at 16:55, Satish Patel <satish.txt@gmail.com> wrote:
Matt,
After enabling om_enable_rabbitmq_high_availability: True and kombu_reconnect_delay=0.5 all my api services started throwing the following logs. Even i rebuild my RabbitMQ cluster again. What could be wrong here?
2023-04-12 15:53:40.380 391 ERROR oslo_service.service amqp.exceptions.PreconditionFailed: Exchange.declare: (406) PRECONDITION_FAILED - inequivalent arg 'durable' for exchange 'neutron' in vhost '/': received 'true' but current is 'false' 2023-04-12 15:53:40.380 391 ERROR oslo_service.service 2023-04-12 15:53:40.380 391 ERROR oslo_service.service During handling of the above exception, another exception occurred: 2023-04-12 15:53:40.380 391 ERROR oslo_service.service 2023-04-12 15:53:40.380 391 ERROR oslo_service.service Traceback (most recent call last): 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_service/service.py", line 806, in run_service 2023-04-12 15:53:40.380 391 ERROR oslo_service.service service.start() 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/neutron/service.py", line 115, in start 2023-04-12 15:53:40.380 391 ERROR oslo_service.service servers = getattr(plugin, self.start_listeners_method)() 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_log/helpers.py", line 67, in wrapper 2023-04-12 15:53:40.380 391 ERROR oslo_service.service return method(*args, **kwargs) 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/neutron/plugins/ml2/plugin.py", line 425, in start_rpc_listeners 2023-04-12 15:53:40.380 391 ERROR oslo_service.service return self.conn.consume_in_threads() 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/neutron_lib/rpc.py", line 351, in consume_in_threads 2023-04-12 15:53:40.380 391 ERROR oslo_service.service server.start() 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/server.py", line 267, in wrapper 2023-04-12 15:53:40.380 391 ERROR oslo_service.service states[state].run_once(lambda: fn(self, *args, **kwargs), 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/server.py", line 188, in run_once 2023-04-12 15:53:40.380 391 ERROR oslo_service.service post_fn = fn() 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/server.py", line 267, in <lambda> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service states[state].run_once(lambda: fn(self, *args, **kwargs), 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/server.py", line 413, in start 2023-04-12 15:53:40.380 391 ERROR oslo_service.service self.listener = self._create_listener() 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 150, in _create_listener 2023-04-12 15:53:40.380 391 ERROR oslo_service.service return self.transport._listen(self._target, 1, None) 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/transport.py", line 142, in _listen 2023-04-12 15:53:40.380 391 ERROR oslo_service.service return self._driver.listen(target, batch_size, 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 702, in listen 2023-04-12 15:53:40.380 391 ERROR oslo_service.service conn.declare_topic_consumer(exchange_name=self._get_exchange(target), 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1295, in declare_topic_consumer 2023-04-12 15:53:40.380 391 ERROR oslo_service.service self.declare_consumer(consumer) 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1192, in declare_consumer 2023-04-12 15:53:40.380 391 ERROR oslo_service.service return self.ensure(_declare_consumer, 2023-04-12 15:53:40.380 391 ERROR oslo_service.service File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 977, in ensure 2023-04-12 15:53:40.380 391 ERROR oslo_service.service raise exceptions.MessageDeliveryFailure(msg) 2023-04-12 15:53:40.380 391 ERROR oslo_service.service oslo_messaging.exceptions.MessageDeliveryFailure: Unable to connect to AMQP server on 10.30.50.3:5672 after inf tries: Exchange.declare: (406) PRECONDITION_FAILED - inequivalent arg 'durable' for exchange 'neutron' in vhost '/': received 'true' but current is 'false' 2023-04-12 15:53:40.380 391 ERROR oslo_service.service
On Wed, Apr 12, 2023 at 11:10 AM Matt Crees <mattc@stackhpc.com> wrote:
Hi Satish,
Yes for a new deployment you will just need to set that variable to true. However, that will enable the high availability of RabbitMQ queues using a combination of classic queue mirroring and durable queues. Quorum queues are not yet supported via Kolla Ansible.
Cheers, Matt
On Wed, 12 Apr 2023 at 16:04, Satish Patel <satish.txt@gmail.com> wrote:
Matt,
For new deployment how do I enable the Quorum queue?
Just adding the following should be enough?
om_enable_rabbitmq_high_availability: True
On Wed, Apr 12, 2023 at 10:54 AM Matt Crees <mattc@stackhpc.com> wrote:
Yes, and the option also needs to be under the oslo_messaging_rabbit heading:
[oslo_messaging_rabbit] kombu_reconnect_delay=0.5
On Wed, 12 Apr 2023 at 15:45, Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> wrote:
Hi. Create global.conf in /etc/kolla/config/
On Wed, Apr 12, 2023, 9:42 PM Satish Patel <satish.txt@gmail.com> wrote:
Hi Matt,
How do I set kombu_reconnect_delay=0.5 option?
Something like the following in global.yml?
kombu_reconnect_delay: 0.5
On Wed, Apr 12, 2023 at 4:23 AM Matt Crees <mattc@stackhpc.com> wrote: > > Hi all, > > It seems worth noting here that there is a fix ongoing in > oslo.messaging which will resolve the issues with HA failing when one > node is down. See here: > https://review.opendev.org/c/openstack/oslo.messaging/+/866617 > In the meantime, we have also found that setting kombu_reconnect_delay > = 0.5 does resolve this issue. > > As for why om_enable_rabbitmq_high_availability is currently > defaulting to false, as Michal said enabling it in stable releases > will impact users. This is because it enables durable queues, and the > migration from transient to durable queues is not a seamless > procedure. It requires that the state of RabbitMQ is reset and that > the OpenStack services which use RabbitMQ are restarted to recreate > the queues. > > I think that there is some merit in changing this default value. But > if we did this, we should either add additional support to automate > the migration from transient to durable queues, or at the very least > provide some decent docs on the manual procedure. > > However, as classic queue mirroring is deprecated in RabbitMQ (to be > removed in RabbitMQ 4.0) we should maybe consider switching to quorum > queues soon. Then it may be beneficial to leave the classic queue > mirroring + durable queues setup as False by default. This is because > the migration between queue types (durable or quorum) can take several > hours on larger deployments. So it might be worth making sure the > default values only require one migration to quorum queues in the > future, rather than two (durable queues now and then quorum queues in > the future). > > We will need to make this switch eventually, but right now RabbitMQ > 4.0 does not even have a set release date, so it's not the most urgent > change. > > Cheers, > Matt > > >Hi Michal, > > > >Feel free to propose change of default in master branch, but I don?t think we can change the default in stable branches without impacting users. > > > >Best regards, > >Michal > > > >> On 11 Apr 2023, at 15:18, Michal Arbet <michal.arbet@ultimum.io> wrote: > >> > >> Hi, > >> > >> Btw, why we have such option set to false ? > >> Michal Arbet > >> Openstack Engineer > >> > >> Ultimum Technologies a.s. > >> Na Po???? 1047/26, 11000 Praha 1 > >> Czech Republic > >> > >> +420 604 228 897 <> > >> michal.arbet@ultimum.io <mailto:michal.arbet@ultimum.io> > >> https://ultimum.io <https://ultimum.io/> > >> > >> LinkedIn <https://www.linkedin.com/company/ultimum-technologies> | Twitter <https://twitter.com/ultimumtech> | Facebook <https://www.facebook.com/ultimumtechnologies/timeline> > >> > >> > >> ?t 11. 4. 2023 v 14:48 odes?latel Micha? Nasiadka <mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>> napsal: > >>> Hello, > >>> > >>> RabbitMQ HA has been backported into stable releases, and it?s documented here: > >>> https://docs.openstack.org/kolla-ansible/yoga/reference/message-queues/rabbi... > >>> > >>> Best regards, > >>> Michal > >>> > >>> W dniu wt., 11.04.2023 o 13:32 Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> napisa?(a): > >>>> Yes. > >>>> But cluster cannot work properly without it. :( > >>>> > >>>> On Tue, Apr 11, 2023, 6:18 PM Danny Webb <Danny.Webb@thehutgroup.com <mailto:Danny.Webb@thehutgroup.com>> wrote: > >>>>> This commit explains why they largely removed HA queue durability: > >>>>> > >>>>> https://opendev.org/openstack/kolla-ansible/commit/2764844ee2ff9393a4eebd90a... > >>>>> From: Satish Patel <satish.txt@gmail.com <mailto:satish.txt@gmail.com>> > >>>>> Sent: 09 April 2023 04:16 > >>>>> To: Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> > >>>>> Cc: OpenStack Discuss <openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org>> > >>>>> Subject: Re: [openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down > >>>>> > >>>>> > >>>>> CAUTION: This email originates from outside THG > >>>>> > >>>>> Are you proposing a solution or just raising an issue? > >>>>> > >>>>> I did find it strange that kolla-ansible doesn't support HA queue by default. That is a disaster because when one of the nodes goes down it will make the whole rabbitMQ unacceptable. Whenever i deploy kolla i have to add HA policy to make queue HA otherwise you will endup in problem. > >>>>> > >>>>> On Sat, Apr 8, 2023 at 6:40?AM Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> wrote: > >>>>> Hello everyone. > >>>>> > >>>>> I want to summary for who meets problems with Openstack when deploy cluster with 3 controller using Kolla Ansible > >>>>> > >>>>> Scenario: 1 of 3 controller is down > >>>>> > >>>>> 1. Login horizon and use API such as nova, cinder will be very slow > >>>>> > >>>>> fix by: > >>>>> > >>>>> nano: > >>>>> kolla-ansible/ansible/roles/heat/templates/heat.conf.j2 > >>>>> kolla-ansible/ansible/roles/nova/templates/nova.conf.j2 > >>>>> kolla-ansible/ansible/roles/keystone/templates/keystone.conf.j2 > >>>>> kolla-ansible/ansible/roles/neutron/templates/neutron.conf.j2 > >>>>> kolla-ansible/ansible/roles/cinder/templates/cinder.conf.j2 > >>>>> > >>>>> or which service need caches > >>>>> > >>>>> add as below > >>>>> > >>>>> [cache] > >>>>> backend = oslo_cache.memcache_pool > >>>>> enabled = True > >>>>> memcache_servers = {{ kolla_internal_vip_address }}:{{ memcached_port }} > >>>>> memcache_dead_retry = 0.25 > >>>>> memcache_socket_timeout = 900 > >>>>> > >>>>> https://review.opendev.org/c/openstack/kolla-ansible/+/849487 > >>>>> > >>>>> but it is not the end > >>>>> > >>>>> 2. Cannot launch instance or mapping block device(stuck at this step) > >>>>> > >>>>> nano kolla-ansible/ansible/roles/rabbitmq/templates/definitions.json.j2 > >>>>> > >>>>> "policies":[ > >>>>> {"vhost": "/", "name": "ha-all", "pattern": "^(?!(amq\.)|(.*_fanout_)|(reply_)).*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, > >>>>> {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} > >>>>> {% endif %} > >>>>> ] > >>>>> > >>>>> nano /etc/kollla/global.conf > >>>>> > >>>>> [oslo_messaging_rabbit] > >>>>> kombu_reconnect_delay=0.5 > >>>>> > >>>>> > >>>>> https://bugs.launchpad.net/oslo.messaging/+bug/1993149 > >>>>> https://docs.openstack.org/large-scale/journey/configure/rabbitmq.html > >>>>> > >>>>> I used Xena 13.4 and Yoga 14.8.1. > >>>>> > >>>>> Above bugs are critical, but I see that it was not fixed. I am just an operator and I want to share what I encountered for new people who come to Openstack > >>>>> > >>>>> > >>>>> Nguyen Huu Khoi > >>> -- > >>> Micha? Nasiadka > >>> mnasiadka@gmail.com <mailto:mnasiadka@gmail.com> >
Hi Matt,
As you're now reconfiguring a running deployment, there are some extra steps that need to be taken to migrate to durable queues.
1. You will need to stop the OpenStack services which use RabbitMQ.
2. Reset the state of RabbitMQ one each rabbit node with the following commands. You must run each command on all RabbitMQ nodes before moving on to the next command. This will remove all queues.
rabbitmqctl stop_app rabbitmqctl force_reset rabbitmqctl start_app
3. Start the OpenStack services again, at which point they will recreate the appropriate queues as durable.
This sounds like a great new addition-to-be to the Kolla Ansible docs! Could you please propose it as a change? Kindest, Radek
This is great! Matt, Documentation would be greatly appreciated. I have a counter question: does Durable queue be good for large clouds with 1000 compute nodes or better to not use durable queue. This is a private cloud and we don't care about persistent data. On Wed, Apr 12, 2023 at 12:37 PM Radosław Piliszek < radoslaw.piliszek@gmail.com> wrote:
Hi Matt,
As you're now reconfiguring a running deployment, there are some extra steps that need to be taken to migrate to durable queues.
1. You will need to stop the OpenStack services which use RabbitMQ.
2. Reset the state of RabbitMQ one each rabbit node with the following commands. You must run each command on all RabbitMQ nodes before moving on to the next command. This will remove all queues.
rabbitmqctl stop_app rabbitmqctl force_reset rabbitmqctl start_app
3. Start the OpenStack services again, at which point they will recreate the appropriate queues as durable.
This sounds like a great new addition-to-be to the Kolla Ansible docs! Could you please propose it as a change?
Kindest, Radek
Hello guys. I do many tests on xena and yoga. then i am sure that without ha-queue and kombu_reconnect_delay=0.5(it can < 1) you cannot launch instances when 1 of 3 controllers is down. Somebody can verify what I say, I hope we will have a common solution for this problem because those who use openstack for the first time will continue to ask questions like that. Nguyen Huu Khoi On Thu, Apr 13, 2023 at 12:59 AM Satish Patel <satish.txt@gmail.com> wrote:
This is great! Matt,
Documentation would be greatly appreciated. I have a counter question: does Durable queue be good for large clouds with 1000 compute nodes or better to not use durable queue. This is a private cloud and we don't care about persistent data.
On Wed, Apr 12, 2023 at 12:37 PM Radosław Piliszek < radoslaw.piliszek@gmail.com> wrote:
Hi Matt,
As you're now reconfiguring a running deployment, there are some extra steps that need to be taken to migrate to durable queues.
1. You will need to stop the OpenStack services which use RabbitMQ.
2. Reset the state of RabbitMQ one each rabbit node with the following commands. You must run each command on all RabbitMQ nodes before moving on to the next command. This will remove all queues.
rabbitmqctl stop_app rabbitmqctl force_reset rabbitmqctl start_app
3. Start the OpenStack services again, at which point they will recreate the appropriate queues as durable.
This sounds like a great new addition-to-be to the Kolla Ansible docs! Could you please propose it as a change?
Kindest, Radek
update: I use SAN as Cinder backend. Nguyen Huu Khoi On Thu, Apr 13, 2023 at 9:02 AM Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> wrote:
Hello guys.
I do many tests on xena and yoga. then i am sure that without ha-queue and kombu_reconnect_delay=0.5(it can < 1) you cannot launch instances when 1 of 3 controllers is down. Somebody can verify what I say, I hope we will have a common solution for this problem because those who use openstack for the first time will continue to ask questions like that.
Nguyen Huu Khoi
On Thu, Apr 13, 2023 at 12:59 AM Satish Patel <satish.txt@gmail.com> wrote:
This is great! Matt,
Documentation would be greatly appreciated. I have a counter question: does Durable queue be good for large clouds with 1000 compute nodes or better to not use durable queue. This is a private cloud and we don't care about persistent data.
On Wed, Apr 12, 2023 at 12:37 PM Radosław Piliszek < radoslaw.piliszek@gmail.com> wrote:
Hi Matt,
As you're now reconfiguring a running deployment, there are some extra steps that need to be taken to migrate to durable queues.
1. You will need to stop the OpenStack services which use RabbitMQ.
2. Reset the state of RabbitMQ one each rabbit node with the following commands. You must run each command on all RabbitMQ nodes before moving on to the next command. This will remove all queues.
rabbitmqctl stop_app rabbitmqctl force_reset rabbitmqctl start_app
3. Start the OpenStack services again, at which point they will recreate the appropriate queues as durable.
This sounds like a great new addition-to-be to the Kolla Ansible docs! Could you please propose it as a change?
Kindest, Radek
Hi all, I'll reply in turn here: Radek, I agree it definitely will be a good addition to the KA docs. I've got it on my radar, will aim to get a patch proposed this week. Satish, I haven't personally been able to test durable queues on a system that large. According to the RabbitMQ docs (https://www.rabbitmq.com/queues.html#durability), "Throughput and latency of a queue is not affected by whether a queue is durable or not in most cases." However, I have anecdotally heard that it can affect some performance in particularly large systems. Please note that if you are using the classic mirrored queues, you must also have them durable. Transient (i.e. non-durable) mirrored queues are not a supported feature and do cause bugs. (For example the "old incarnation" errors seen here: https://bugs.launchpad.net/kolla-ansible/+bug/1954925) Nguyen, I can confirm that we've seen the same behaviour. This is caused by a backported change in oslo.messaging (I believe you linked the relevant bug report previously). There is a fix in progress to fix this (https://review.opendev.org/c/openstack/oslo.messaging/+/866617), and in the meantime setting kombu_reconnect_delay < 1.0 does resolve it. Cheers, Matt On Thu, 13 Apr 2023 at 03:04, Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> wrote:
update: I use SAN as Cinder backend. Nguyen Huu Khoi
On Thu, Apr 13, 2023 at 9:02 AM Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> wrote:
Hello guys.
I do many tests on xena and yoga. then i am sure that without ha-queue and kombu_reconnect_delay=0.5(it can < 1) you cannot launch instances when 1 of 3 controllers is down. Somebody can verify what I say, I hope we will have a common solution for this problem because those who use openstack for the first time will continue to ask questions like that.
Nguyen Huu Khoi
On Thu, Apr 13, 2023 at 12:59 AM Satish Patel <satish.txt@gmail.com> wrote:
This is great! Matt,
Documentation would be greatly appreciated. I have a counter question: does Durable queue be good for large clouds with 1000 compute nodes or better to not use durable queue. This is a private cloud and we don't care about persistent data.
On Wed, Apr 12, 2023 at 12:37 PM Radosław Piliszek <radoslaw.piliszek@gmail.com> wrote:
Hi Matt,
As you're now reconfiguring a running deployment, there are some extra steps that need to be taken to migrate to durable queues.
1. You will need to stop the OpenStack services which use RabbitMQ.
2. Reset the state of RabbitMQ one each rabbit node with the following commands. You must run each command on all RabbitMQ nodes before moving on to the next command. This will remove all queues.
rabbitmqctl stop_app rabbitmqctl force_reset rabbitmqctl start_app
3. Start the OpenStack services again, at which point they will recreate the appropriate queues as durable.
This sounds like a great new addition-to-be to the Kolla Ansible docs! Could you please propose it as a change?
Kindest, Radek
On Thu, 2023-04-13 at 09:07 +0100, Matt Crees wrote:
Hi all, I'll reply in turn here:
Radek, I agree it definitely will be a good addition to the KA docs. I've got it on my radar, will aim to get a patch proposed this week.
Satish, I haven't personally been able to test durable queues on a system that large. According to the RabbitMQ docs (https://www.rabbitmq.com/queues.html#durability), "Throughput and latency of a queue is not affected by whether a queue is durable or not in most cases." However, I have anecdotally heard that it can affect some performance in particularly large systems.
the perfomacne of durable queue is dominated by the disk io put rabbit on a pcie nvme ssd and it will have littel effect use spinnign rust (a HDD even in raid 10) and the iops/throput of the storage sued to make the qure durible (that just mean write all messages to disk) will be the bottlenck for scaleablity combine that with HA and that is worse as it has to be writen to multipel servers. i belive that teh mirror implemantion will wait for all copies to be persisted but i have not really looked into it. it was raised as a pain point with rabbit by operator in the past in terms fo scaling.
Please note that if you are using the classic mirrored queues, you must also have them durable. Transient (i.e. non-durable) mirrored queues are not a supported feature and do cause bugs. (For example the "old incarnation" errors seen here: https://bugs.launchpad.net/kolla-ansible/+bug/1954925)
Nguyen, I can confirm that we've seen the same behaviour. This is caused by a backported change in oslo.messaging (I believe you linked the relevant bug report previously). There is a fix in progress to fix this (https://review.opendev.org/c/openstack/oslo.messaging/+/866617), and in the meantime setting kombu_reconnect_delay < 1.0 does resolve it.
Cheers, Matt
On Thu, 13 Apr 2023 at 03:04, Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> wrote:
update: I use SAN as Cinder backend. Nguyen Huu Khoi
On Thu, Apr 13, 2023 at 9:02 AM Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> wrote:
Hello guys.
I do many tests on xena and yoga. then i am sure that without ha-queue and kombu_reconnect_delay=0.5(it can < 1) you cannot launch instances when 1 of 3 controllers is down. Somebody can verify what I say, I hope we will have a common solution for this problem because those who use openstack for the first time will continue to ask questions like that.
Nguyen Huu Khoi
On Thu, Apr 13, 2023 at 12:59 AM Satish Patel <satish.txt@gmail.com> wrote:
This is great! Matt,
Documentation would be greatly appreciated. I have a counter question: does Durable queue be good for large clouds with 1000 compute nodes or better to not use durable queue. This is a private cloud and we don't care about persistent data.
On Wed, Apr 12, 2023 at 12:37 PM Radosław Piliszek <radoslaw.piliszek@gmail.com> wrote:
Hi Matt,
As you're now reconfiguring a running deployment, there are some extra steps that need to be taken to migrate to durable queues.
1. You will need to stop the OpenStack services which use RabbitMQ.
2. Reset the state of RabbitMQ one each rabbit node with the following commands. You must run each command on all RabbitMQ nodes before moving on to the next command. This will remove all queues.
rabbitmqctl stop_app rabbitmqctl force_reset rabbitmqctl start_app
3. Start the OpenStack services again, at which point they will recreate the appropriate queues as durable.
This sounds like a great new addition-to-be to the Kolla Ansible docs! Could you please propose it as a change?
Kindest, Radek
Thank Sean/Matt, This is interesting that I have only option to use classic mirror with durable :( because without mirror cluster acting up when one of node is down. How people are scaling rabbitMQ for large scale? Sent from my iPhone
On Apr 13, 2023, at 7:50 AM, Sean Mooney <smooney@redhat.com> wrote:
On Thu, 2023-04-13 at 09:07 +0100, Matt Crees wrote:
Hi all, I'll reply in turn here:
Radek, I agree it definitely will be a good addition to the KA docs. I've got it on my radar, will aim to get a patch proposed this week.
Satish, I haven't personally been able to test durable queues on a system that large. According to the RabbitMQ docs (https://www.rabbitmq.com/queues.html#durability), "Throughput and latency of a queue is not affected by whether a queue is durable or not in most cases." However, I have anecdotally heard that it can affect some performance in particularly large systems.
the perfomacne of durable queue is dominated by the disk io put rabbit on a pcie nvme ssd and it will have littel effect use spinnign rust (a HDD even in raid 10) and the iops/throput of the storage sued to make the qure durible (that just mean write all messages to disk) will be the bottlenck for scaleablity combine that with HA and that is worse as it has to be writen to multipel servers. i belive that teh mirror implemantion will wait for all copies to be persisted but i have not really looked into it. it was raised as a pain point with rabbit by operator in the past in terms fo scaling.
Please note that if you are using the classic mirrored queues, you must also have them durable. Transient (i.e. non-durable) mirrored queues are not a supported feature and do cause bugs. (For example the "old incarnation" errors seen here: https://bugs.launchpad.net/kolla-ansible/+bug/1954925)
Nguyen, I can confirm that we've seen the same behaviour. This is caused by a backported change in oslo.messaging (I believe you linked the relevant bug report previously). There is a fix in progress to fix this (https://review.opendev.org/c/openstack/oslo.messaging/+/866617), and in the meantime setting kombu_reconnect_delay < 1.0 does resolve it.
Cheers, Matt
On Thu, 13 Apr 2023 at 03:04, Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> wrote:
update: I use SAN as Cinder backend. Nguyen Huu Khoi
On Thu, Apr 13, 2023 at 9:02 AM Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> wrote:
Hello guys.
I do many tests on xena and yoga. then i am sure that without ha-queue and kombu_reconnect_delay=0.5(it can < 1) you cannot launch instances when 1 of 3 controllers is down. Somebody can verify what I say, I hope we will have a common solution for this problem because those who use openstack for the first time will continue to ask questions like that.
Nguyen Huu Khoi
On Thu, Apr 13, 2023 at 12:59 AM Satish Patel <satish.txt@gmail.com> wrote:
This is great! Matt,
Documentation would be greatly appreciated. I have a counter question: does Durable queue be good for large clouds with 1000 compute nodes or better to not use durable queue. This is a private cloud and we don't care about persistent data.
On Wed, Apr 12, 2023 at 12:37 PM Radosław Piliszek <radoslaw.piliszek@gmail.com> wrote:
Hi Matt,
> As you're now reconfiguring a running deployment, there are > some extra steps that need to be taken to migrate to durable queues. > > 1. You will need to stop the OpenStack services which use RabbitMQ. > > 2. Reset the state of RabbitMQ one each rabbit node with the following > commands. You must run each command on all RabbitMQ nodes before > moving on to the next command. This will remove all queues. > > rabbitmqctl stop_app > rabbitmqctl force_reset > rabbitmqctl start_app > > 3. Start the OpenStack services again, at which point they will > recreate the appropriate queues as durable.
This sounds like a great new addition-to-be to the Kolla Ansible docs! Could you please propose it as a change?
Kindest, Radek
On 13/04/2023 23:04, Satish Patel wrote:
Thank Sean/Matt,
This is interesting that I have only option to use classic mirror with durable :( because without mirror cluster acting up when one of node is down.
If you have planned maintenance and non-mirrored transient queues, you can first try draining the node to be removed, before removing it from the cluster. In my testing at least, this appears to be much more successful than relying on the RabbitMQ clients to do the failover and recreate queues. See [1], or for RMQ <3.8 you can cobble something together with ha-mode nodes [2]. [1] https://www.rabbitmq.com/upgrade.html#maintenance-mode [2] https://www.rabbitmq.com/ha.html#mirroring-arguments This obviously doesn't solve the case of when a controller fails unexpectedly. I also think it's worth making the distinction between a highly available messaging infrastructure, and queue mirroring. In many cases, if a RMQ node hosting a non-mirrored, transient queue goes down, it should be possible for a service to just recreate the queue on another node and retry. This often seems to fail, which leads to queue mirroring getting turned on.
How people are scaling rabbitMQ for large scale?
Sent from my iPhone
On Apr 13, 2023, at 7:50 AM, Sean Mooney <smooney@redhat.com> wrote:
Hi all, I'll reply in turn here:
Radek, I agree it definitely will be a good addition to the KA docs. I've got it on my radar, will aim to get a patch proposed this week.
Satish, I haven't personally been able to test durable queues on a system that large. According to the RabbitMQ docs (https://www.rabbitmq.com/queues.html#durability), "Throughput and latency of a queue is not affected by whether a queue is durable or not in most cases." However, I have anecdotally heard that it can affect some performance in particularly large systems.
On Thu, 2023-04-13 at 09:07 +0100, Matt Crees wrote: the perfomacne of durable queue is dominated by the disk io put rabbit on a pcie nvme ssd and it will have littel effect use spinnign rust (a HDD even in raid 10) and the iops/throput of the storage sued to make the qure durible (that just mean write all messages to disk) will be the bottlenck for scaleablity combine that with HA and that is worse as it has to be writen to multipel servers. i belive that teh mirror implemantion will wait for all copies to be persisted but i have not really looked into it. it was raised as a pain point with rabbit by operator in the past in terms fo scaling.
Please note that if you are using the classic mirrored queues, you must also have them durable. Transient (i.e. non-durable) mirrored queues are not a supported feature and do cause bugs. (For example the "old incarnation" errors seen here: https://bugs.launchpad.net/kolla-ansible/+bug/1954925)
Nguyen, I can confirm that we've seen the same behaviour. This is caused by a backported change in oslo.messaging (I believe you linked the relevant bug report previously). There is a fix in progress to fix this (https://review.opendev.org/c/openstack/oslo.messaging/+/866617), and in the meantime setting kombu_reconnect_delay < 1.0 does resolve it.
Cheers, Matt
On Thu, 13 Apr 2023 at 03:04, Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> wrote:
update: I use SAN as Cinder backend. Nguyen Huu Khoi
On Thu, Apr 13, 2023 at 9:02 AM Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> wrote:
Hello guys.
I do many tests on xena and yoga. then i am sure that without ha-queue and kombu_reconnect_delay=0.5(it can < 1) you cannot launch instances when 1 of 3 controllers is down. Somebody can verify what I say, I hope we will have a common solution for this problem because those who use openstack for the first time will continue to ask questions like that.
Nguyen Huu Khoi
On Thu, Apr 13, 2023 at 12:59 AM Satish Patel <satish.txt@gmail.com> wrote:
This is great! Matt,
Documentation would be greatly appreciated. I have a counter question: does Durable queue be good for large clouds with 1000 compute nodes or better to not use durable queue. This is a private cloud and we don't care about persistent data.
On Wed, Apr 12, 2023 at 12:37 PM Radosław Piliszek <radoslaw.piliszek@gmail.com> wrote: > Hi Matt, > >> As you're now reconfiguring a running deployment, there are >> some extra steps that need to be taken to migrate to durable queues. >> >> 1. You will need to stop the OpenStack services which use RabbitMQ. >> >> 2. Reset the state of RabbitMQ one each rabbit node with the following >> commands. You must run each command on all RabbitMQ nodes before >> moving on to the next command. This will remove all queues. >> >> rabbitmqctl stop_app >> rabbitmqctl force_reset >> rabbitmqctl start_app >> >> 3. Start the OpenStack services again, at which point they will >> recreate the appropriate queues as durable. > This sounds like a great new addition-to-be to the Kolla Ansible docs! > Could you please propose it as a change? > > Kindest, > Radek >
Hi everyone, I update that. With failover controller, we dont need use mirror queue if we use ceph as backend. In my case, I use SAN as cinder backend and NFS as glance backend so I need mirror queue. It is quite weird. Nguyen Huu Khoi On Mon, Apr 17, 2023 at 5:03 PM Doug Szumski <doug@stackhpc.com> wrote:
On 13/04/2023 23:04, Satish Patel wrote:
Thank Sean/Matt,
This is interesting that I have only option to use classic mirror with durable :( because without mirror cluster acting up when one of node is down.
If you have planned maintenance and non-mirrored transient queues, you can first try draining the node to be removed, before removing it from the cluster. In my testing at least, this appears to be much more successful than relying on the RabbitMQ clients to do the failover and recreate queues. See [1], or for RMQ <3.8 you can cobble something together with ha-mode nodes [2].
[1] https://www.rabbitmq.com/upgrade.html#maintenance-mode
[2] https://www.rabbitmq.com/ha.html#mirroring-arguments
This obviously doesn't solve the case of when a controller fails unexpectedly.
I also think it's worth making the distinction between a highly available messaging infrastructure, and queue mirroring. In many cases, if a RMQ node hosting a non-mirrored, transient queue goes down, it should be possible for a service to just recreate the queue on another node and retry. This often seems to fail, which leads to queue mirroring getting turned on.
How people are scaling rabbitMQ for large scale?
Sent from my iPhone
On Apr 13, 2023, at 7:50 AM, Sean Mooney <smooney@redhat.com> wrote:
Hi all, I'll reply in turn here:
Radek, I agree it definitely will be a good addition to the KA docs. I've got it on my radar, will aim to get a patch proposed this week.
Satish, I haven't personally been able to test durable queues on a system that large. According to the RabbitMQ docs (https://www.rabbitmq.com/queues.html#durability), "Throughput and latency of a queue is not affected by whether a queue is durable or not in most cases." However, I have anecdotally heard that it can affect some performance in particularly large systems.
On Thu, 2023-04-13 at 09:07 +0100, Matt Crees wrote: the perfomacne of durable queue is dominated by the disk io put rabbit on a pcie nvme ssd and it will have littel effect use spinnign rust (a HDD even in raid 10) and the iops/throput of the storage sued to make the qure durible (that just mean write all
will be the bottlenck for scaleablity combine that with HA and that is worse as it has to be writen to multipel servers. i belive that teh mirror implemantion will wait for all copies to be
messages to disk) persisted
but i have not really looked into it. it was raised as a pain point with rabbit by operator in the past in terms fo scaling.
Please note that if you are using the classic mirrored queues, you must also have them durable. Transient (i.e. non-durable) mirrored queues are not a supported feature and do cause bugs. (For example the "old incarnation" errors seen here: https://bugs.launchpad.net/kolla-ansible/+bug/1954925)
Nguyen, I can confirm that we've seen the same behaviour. This is caused by a backported change in oslo.messaging (I believe you linked the relevant bug report previously). There is a fix in progress to fix this (https://review.opendev.org/c/openstack/oslo.messaging/+/866617), and in the meantime setting kombu_reconnect_delay < 1.0 does resolve it.
Cheers, Matt
On Thu, 13 Apr 2023 at 03:04, Nguyễn Hữu Khôi < nguyenhuukhoinw@gmail.com> wrote:
update: I use SAN as Cinder backend. Nguyen Huu Khoi
On Thu, Apr 13, 2023 at 9:02 AM Nguyễn Hữu Khôi < nguyenhuukhoinw@gmail.com> wrote:
Hello guys.
I do many tests on xena and yoga. then i am sure that without ha-queue and kombu_reconnect_delay=0.5(it can < 1) you cannot launch instances when 1 of 3 controllers is down. Somebody can verify what I say, I hope we will have a common solution for this problem because those who use openstack for the first time will continue to ask questions like that.
Nguyen Huu Khoi
On Thu, Apr 13, 2023 at 12:59 AM Satish Patel <satish.txt@gmail.com> wrote: > This is great! Matt, > > Documentation would be greatly appreciated. I have a counter question: does Durable queue be good for large clouds with 1000 compute nodes or better to not use durable queue. This is a private cloud and we don't care about persistent data. > > On Wed, Apr 12, 2023 at 12:37 PM Radosław Piliszek < radoslaw.piliszek@gmail.com> wrote: >> Hi Matt, >> >>> As you're now reconfiguring a running deployment, there are >>> some extra steps that need to be taken to migrate to durable queues. >>> >>> 1. You will need to stop the OpenStack services which use RabbitMQ. >>> >>> 2. Reset the state of RabbitMQ one each rabbit node with the following >>> commands. You must run each command on all RabbitMQ nodes before >>> moving on to the next command. This will remove all queues. >>> >>> rabbitmqctl stop_app >>> rabbitmqctl force_reset >>> rabbitmqctl start_app >>> >>> 3. Start the OpenStack services again, at which point they will >>> recreate the appropriate queues as durable. >> This sounds like a great new addition-to-be to the Kolla Ansible docs! >> Could you please propose it as a change? >> >> Kindest, >> Radek >>
Oh wait! What is the relation of Rabbit mirror queue with Ceph/NFS or any shared storage backend? On Mon, Apr 24, 2023 at 2:17 AM Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> wrote:
Hi everyone, I update that. With failover controller, we dont need use mirror queue if we use ceph as backend. In my case, I use SAN as cinder backend and NFS as glance backend so I need mirror queue. It is quite weird. Nguyen Huu Khoi
On Mon, Apr 17, 2023 at 5:03 PM Doug Szumski <doug@stackhpc.com> wrote:
On 13/04/2023 23:04, Satish Patel wrote:
Thank Sean/Matt,
This is interesting that I have only option to use classic mirror with durable :( because without mirror cluster acting up when one of node is down.
If you have planned maintenance and non-mirrored transient queues, you can first try draining the node to be removed, before removing it from the cluster. In my testing at least, this appears to be much more successful than relying on the RabbitMQ clients to do the failover and recreate queues. See [1], or for RMQ <3.8 you can cobble something together with ha-mode nodes [2].
[1] https://www.rabbitmq.com/upgrade.html#maintenance-mode
[2] https://www.rabbitmq.com/ha.html#mirroring-arguments
This obviously doesn't solve the case of when a controller fails unexpectedly.
I also think it's worth making the distinction between a highly available messaging infrastructure, and queue mirroring. In many cases, if a RMQ node hosting a non-mirrored, transient queue goes down, it should be possible for a service to just recreate the queue on another node and retry. This often seems to fail, which leads to queue mirroring getting turned on.
How people are scaling rabbitMQ for large scale?
Sent from my iPhone
On Apr 13, 2023, at 7:50 AM, Sean Mooney <smooney@redhat.com> wrote:
Hi all, I'll reply in turn here:
Radek, I agree it definitely will be a good addition to the KA docs. I've got it on my radar, will aim to get a patch proposed this week.
Satish, I haven't personally been able to test durable queues on a system that large. According to the RabbitMQ docs (https://www.rabbitmq.com/queues.html#durability), "Throughput and latency of a queue is not affected by whether a queue is durable or not in most cases." However, I have anecdotally heard that it can affect some performance in particularly large systems.
On Thu, 2023-04-13 at 09:07 +0100, Matt Crees wrote: the perfomacne of durable queue is dominated by the disk io put rabbit on a pcie nvme ssd and it will have littel effect use spinnign rust (a HDD even in raid 10) and the iops/throput of the storage sued to make the qure durible (that just mean write all
will be the bottlenck for scaleablity combine that with HA and that is worse as it has to be writen to multipel servers. i belive that teh mirror implemantion will wait for all copies to be
messages to disk) persisted
but i have not really looked into it. it was raised as a pain point with rabbit by operator in the past in terms fo scaling.
Please note that if you are using the classic mirrored queues, you must also have them durable. Transient (i.e. non-durable) mirrored queues are not a supported feature and do cause bugs. (For example the "old incarnation" errors seen here: https://bugs.launchpad.net/kolla-ansible/+bug/1954925)
Nguyen, I can confirm that we've seen the same behaviour. This is caused by a backported change in oslo.messaging (I believe you linked the relevant bug report previously). There is a fix in progress to fix this (https://review.opendev.org/c/openstack/oslo.messaging/+/866617 ), and in the meantime setting kombu_reconnect_delay < 1.0 does resolve it.
Cheers, Matt
On Thu, 13 Apr 2023 at 03:04, Nguyễn Hữu Khôi < nguyenhuukhoinw@gmail.com> wrote:
update: I use SAN as Cinder backend. Nguyen Huu Khoi
On Thu, Apr 13, 2023 at 9:02 AM Nguyễn Hữu Khôi < nguyenhuukhoinw@gmail.com> wrote: > Hello guys. > > I do many tests on xena and yoga. then i am sure that without ha-queue and kombu_reconnect_delay=0.5(it can < 1) > you cannot launch instances when 1 of 3 controllers is down. > Somebody can verify what I say, I hope we will have a common solution for this problem because those who use openstack for the first time will continue to ask questions like that. > > Nguyen Huu Khoi > > > On Thu, Apr 13, 2023 at 12:59 AM Satish Patel <satish.txt@gmail.com> wrote: >> This is great! Matt, >> >> Documentation would be greatly appreciated. I have a counter question: does Durable queue be good for large clouds with 1000 compute nodes or better to not use durable queue. This is a private cloud and we don't care about persistent data. >> >> On Wed, Apr 12, 2023 at 12:37 PM Radosław Piliszek < radoslaw.piliszek@gmail.com> wrote: >>> Hi Matt, >>> >>>> As you're now reconfiguring a running deployment, there are >>>> some extra steps that need to be taken to migrate to durable queues. >>>> >>>> 1. You will need to stop the OpenStack services which use RabbitMQ. >>>> >>>> 2. Reset the state of RabbitMQ one each rabbit node with the following >>>> commands. You must run each command on all RabbitMQ nodes before >>>> moving on to the next command. This will remove all queues. >>>> >>>> rabbitmqctl stop_app >>>> rabbitmqctl force_reset >>>> rabbitmqctl start_app >>>> >>>> 3. Start the OpenStack services again, at which point they will >>>> recreate the appropriate queues as durable. >>> This sounds like a great new addition-to-be to the Kolla Ansible docs! >>> Could you please propose it as a change? >>> >>> Kindest, >>> Radek >>>
Hi Nguyễn, Oh!! make sense, In your post you did the following that is why i got confused :) Let me try it in /etc/kolla/config/global.conf file and run deploy.
nano /etc/kollla/global.conf
[oslo_messaging_rabbit] kombu_reconnect_delay=0.5
On Wed, Apr 12, 2023 at 10:45 AM Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com> wrote:
Hi. Create global.conf in /etc/kolla/config/
On Wed, Apr 12, 2023, 9:42 PM Satish Patel <satish.txt@gmail.com> wrote:
Hi Matt,
How do I set kombu_reconnect_delay=0.5 option?
Something like the following in global.yml?
kombu_reconnect_delay: 0.5
On Wed, Apr 12, 2023 at 4:23 AM Matt Crees <mattc@stackhpc.com> wrote:
Hi all,
It seems worth noting here that there is a fix ongoing in oslo.messaging which will resolve the issues with HA failing when one node is down. See here: https://review.opendev.org/c/openstack/oslo.messaging/+/866617 In the meantime, we have also found that setting kombu_reconnect_delay = 0.5 does resolve this issue.
As for why om_enable_rabbitmq_high_availability is currently defaulting to false, as Michal said enabling it in stable releases will impact users. This is because it enables durable queues, and the migration from transient to durable queues is not a seamless procedure. It requires that the state of RabbitMQ is reset and that the OpenStack services which use RabbitMQ are restarted to recreate the queues.
I think that there is some merit in changing this default value. But if we did this, we should either add additional support to automate the migration from transient to durable queues, or at the very least provide some decent docs on the manual procedure.
However, as classic queue mirroring is deprecated in RabbitMQ (to be removed in RabbitMQ 4.0) we should maybe consider switching to quorum queues soon. Then it may be beneficial to leave the classic queue mirroring + durable queues setup as False by default. This is because the migration between queue types (durable or quorum) can take several hours on larger deployments. So it might be worth making sure the default values only require one migration to quorum queues in the future, rather than two (durable queues now and then quorum queues in the future).
We will need to make this switch eventually, but right now RabbitMQ 4.0 does not even have a set release date, so it's not the most urgent change.
Cheers, Matt
Hi Michal,
Feel free to propose change of default in master branch, but I don?t think we can change the default in stable branches without impacting users.
Best regards, Michal
On 11 Apr 2023, at 15:18, Michal Arbet <michal.arbet@ultimum.io> wrote:
Hi,
Btw, why we have such option set to false ? Michal Arbet Openstack Engineer
Ultimum Technologies a.s. Na Po???? 1047/26, 11000 Praha 1 Czech Republic
+420 604 228 897 <> michal.arbet@ultimum.io <mailto:michal.arbet@ultimum.io> https://ultimum.io <https://ultimum.io/>
LinkedIn <https://www.linkedin.com/company/ultimum-technologies> | Twitter <https://twitter.com/ultimumtech> | Facebook < https://www.facebook.com/ultimumtechnologies/timeline>
?t 11. 4. 2023 v 14:48 odes?latel Micha? Nasiadka < mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>> napsal:
Hello,
RabbitMQ HA has been backported into stable releases, and it?s documented here:
https://docs.openstack.org/kolla-ansible/yoga/reference/message-queues/rabbi...
Best regards, Michal
W dniu wt., 11.04.2023 o 13:32 Nguy?n H?u Kh?i <
nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> napisa?(a):
> Yes. > But cluster cannot work properly without it. :( > > On Tue, Apr 11, 2023, 6:18 PM Danny Webb < Danny.Webb@thehutgroup.com <mailto:Danny.Webb@thehutgroup.com>> wrote: >> This commit explains why they largely removed HA queue durability: >> >> https://opendev.org/openstack/kolla-ansible/commit/2764844ee2ff9393a4eebd90a... >> From: Satish Patel <satish.txt@gmail.com <mailto: satish.txt@gmail.com>> >> Sent: 09 April 2023 04:16 >> To: Nguy?n H?u Kh?i <nguyenhuukhoinw@gmail.com <mailto: nguyenhuukhoinw@gmail.com>> >> Cc: OpenStack Discuss <openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org>> >> Subject: Re: [openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down >> >> >> CAUTION: This email originates from outside THG >> >> Are you proposing a solution or just raising an issue? >> >> I did find it strange that kolla-ansible doesn't support HA queue by default. That is a disaster because when one of the nodes goes down it will make the whole rabbitMQ unacceptable. Whenever i deploy kolla i have to add HA policy to make queue HA otherwise you will endup in problem. >> >> On Sat, Apr 8, 2023 at 6:40?AM Nguy?n H?u Kh?i < nguyenhuukhoinw@gmail.com <mailto:nguyenhuukhoinw@gmail.com>> wrote: >> Hello everyone. >> >> I want to summary for who meets problems with Openstack when deploy cluster with 3 controller using Kolla Ansible >> >> Scenario: 1 of 3 controller is down >> >> 1. Login horizon and use API such as nova, cinder will be very slow >> >> fix by: >> >> nano: >> kolla-ansible/ansible/roles/heat/templates/heat.conf.j2 >> kolla-ansible/ansible/roles/nova/templates/nova.conf.j2 >> kolla-ansible/ansible/roles/keystone/templates/keystone.conf.j2 >> kolla-ansible/ansible/roles/neutron/templates/neutron.conf.j2 >> kolla-ansible/ansible/roles/cinder/templates/cinder.conf.j2 >> >> or which service need caches >> >> add as below >> >> [cache] >> backend = oslo_cache.memcache_pool >> enabled = True >> memcache_servers = {{ kolla_internal_vip_address }}:{{ memcached_port }} >> memcache_dead_retry = 0.25 >> memcache_socket_timeout = 900 >> >> https://review.opendev.org/c/openstack/kolla-ansible/+/849487 >> >> but it is not the end >> >> 2. Cannot launch instance or mapping block device(stuck at this step) >> >> nano kolla-ansible/ansible/roles/rabbitmq/templates/definitions.json.j2 >> >> "policies":[ >> {"vhost": "/", "name": "ha-all", "pattern": "^(?!(amq\.)|(.*_fanout_)|(reply_)).*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %}, >> {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0} >> {% endif %} >> ] >> >> nano /etc/kollla/global.conf >> >> [oslo_messaging_rabbit] >> kombu_reconnect_delay=0.5 >> >> >> https://bugs.launchpad.net/oslo.messaging/+bug/1993149 >> https://docs.openstack.org/large-scale/journey/configure/rabbitmq.html >> >> I used Xena 13.4 and Yoga 14.8.1. >> >> Above bugs are critical, but I see that it was not fixed. I am just an operator and I want to share what I encountered for new people who come to Openstack >> >> >> Nguyen Huu Khoi -- Micha? Nasiadka mnasiadka@gmail.com <mailto:mnasiadka@gmail.com>
participants (6)
-
Doug Szumski
-
Matt Crees
-
Nguyễn Hữu Khôi
-
Radosław Piliszek
-
Satish Patel
-
Sean Mooney