[openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down

Matt Crees mattc at stackhpc.com
Wed Apr 12 16:24:58 UTC 2023


Hi Satish,

Apologies, I would have mentioned this before but I thought when you
mentioned a new deployment you meant starting a completely fresh
deploy. As you're now reconfiguring a running deployment, there are
some extra steps that need to be taken to migrate to durable queues.

1. You will need to stop the OpenStack services which use RabbitMQ.

2. Reset the state of RabbitMQ one each rabbit node with the following
commands. You must run each command on all RabbitMQ nodes before
moving on to the next command. This will remove all queues.

    rabbitmqctl stop_app
    rabbitmqctl force_reset
    rabbitmqctl start_app

3. Start the OpenStack services again, at which point they will
recreate the appropriate queues as durable.

On Wed, 12 Apr 2023 at 16:55, Satish Patel <satish.txt at gmail.com> wrote:
>
> Matt,
>
> After enabling om_enable_rabbitmq_high_availability: True  and kombu_reconnect_delay=0.5  all my api services started throwing the following logs. Even i rebuild my RabbitMQ cluster again. What could be wrong here?
>
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service amqp.exceptions.PreconditionFailed: Exchange.declare: (406) PRECONDITION_FAILED - inequivalent arg 'durable' for exchange 'neutron' in vhost '/': received 'true' but current is 'false'
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service During handling of the above exception, another exception occurred:
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service Traceback (most recent call last):
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_service/service.py", line 806, in run_service
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service     service.start()
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service   File "/var/lib/kolla/venv/lib/python3.10/site-packages/neutron/service.py", line 115, in start
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service     servers = getattr(plugin, self.start_listeners_method)()
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_log/helpers.py", line 67, in wrapper
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service     return method(*args, **kwargs)
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service   File "/var/lib/kolla/venv/lib/python3.10/site-packages/neutron/plugins/ml2/plugin.py", line 425, in start_rpc_listeners
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service     return self.conn.consume_in_threads()
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service   File "/var/lib/kolla/venv/lib/python3.10/site-packages/neutron_lib/rpc.py", line 351, in consume_in_threads
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service     server.start()
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/server.py", line 267, in wrapper
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service     states[state].run_once(lambda: fn(self, *args, **kwargs),
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/server.py", line 188, in run_once
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service     post_fn = fn()
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/server.py", line 267, in <lambda>
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service     states[state].run_once(lambda: fn(self, *args, **kwargs),
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/server.py", line 413, in start
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service     self.listener = self._create_listener()
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 150, in _create_listener
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service     return self.transport._listen(self._target, 1, None)
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/transport.py", line 142, in _listen
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service     return self._driver.listen(target, batch_size,
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 702, in listen
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service     conn.declare_topic_consumer(exchange_name=self._get_exchange(target),
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1295, in declare_topic_consumer
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service     self.declare_consumer(consumer)
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1192, in declare_consumer
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service     return self.ensure(_declare_consumer,
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 977, in ensure
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service     raise exceptions.MessageDeliveryFailure(msg)
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service oslo_messaging.exceptions.MessageDeliveryFailure: Unable to connect to AMQP server on 10.30.50.3:5672 after inf tries: Exchange.declare: (406) PRECONDITION_FAILED - inequivalent arg 'durable' for exchange 'neutron' in vhost '/': received 'true' but current is 'false'
> 2023-04-12 15:53:40.380 391 ERROR oslo_service.service
>
> On Wed, Apr 12, 2023 at 11:10 AM Matt Crees <mattc at stackhpc.com> wrote:
>>
>> Hi Satish,
>>
>> Yes for a new deployment you will just need to set that variable to
>> true. However, that will enable the high availability of RabbitMQ
>> queues using a combination of classic queue mirroring and durable
>> queues.
>> Quorum queues are not yet supported via Kolla Ansible.
>>
>> Cheers,
>> Matt
>>
>> On Wed, 12 Apr 2023 at 16:04, Satish Patel <satish.txt at gmail.com> wrote:
>> >
>> > Matt,
>> >
>> > For new deployment how do I enable the Quorum queue?
>> >
>> > Just adding the following should be enough?
>> >
>> > om_enable_rabbitmq_high_availability: True
>> >
>> > On Wed, Apr 12, 2023 at 10:54 AM Matt Crees <mattc at stackhpc.com> wrote:
>> >>
>> >> Yes, and the option also needs to be under the oslo_messaging_rabbit heading:
>> >>
>> >> [oslo_messaging_rabbit]
>> >> kombu_reconnect_delay=0.5
>> >>
>> >>
>> >> On Wed, 12 Apr 2023 at 15:45, Nguyễn Hữu Khôi <nguyenhuukhoinw at gmail.com> wrote:
>> >> >
>> >> > Hi.
>> >> > Create global.conf in /etc/kolla/config/
>> >> >
>> >> > On Wed, Apr 12, 2023, 9:42 PM Satish Patel <satish.txt at gmail.com> wrote:
>> >> >>
>> >> >> Hi Matt,
>> >> >>
>> >> >> How do I set kombu_reconnect_delay=0.5 option?
>> >> >>
>> >> >> Something like the following in global.yml?
>> >> >>
>> >> >> kombu_reconnect_delay: 0.5
>> >> >>
>> >> >> On Wed, Apr 12, 2023 at 4:23 AM Matt Crees <mattc at stackhpc.com> wrote:
>> >> >>>
>> >> >>> Hi all,
>> >> >>>
>> >> >>> It seems worth noting here that there is a fix ongoing in
>> >> >>> oslo.messaging which will resolve the issues with HA failing when one
>> >> >>> node is down. See here:
>> >> >>> https://review.opendev.org/c/openstack/oslo.messaging/+/866617
>> >> >>> In the meantime, we have also found that setting kombu_reconnect_delay
>> >> >>> = 0.5 does resolve this issue.
>> >> >>>
>> >> >>> As for why om_enable_rabbitmq_high_availability is currently
>> >> >>> defaulting to false, as Michal said enabling it in stable releases
>> >> >>> will impact users. This is because it enables durable queues, and the
>> >> >>> migration from transient to durable queues is not a seamless
>> >> >>> procedure. It requires that the state of RabbitMQ is reset and that
>> >> >>> the OpenStack services which use RabbitMQ are restarted to recreate
>> >> >>> the queues.
>> >> >>>
>> >> >>> I think that there is some merit in changing this default value. But
>> >> >>> if we did this, we should either add additional support to automate
>> >> >>> the migration from transient to durable queues, or at the very least
>> >> >>> provide some decent docs on the manual procedure.
>> >> >>>
>> >> >>> However, as classic queue mirroring is deprecated in RabbitMQ (to be
>> >> >>> removed in RabbitMQ 4.0) we should maybe consider switching to quorum
>> >> >>> queues soon. Then it may be beneficial to leave the classic queue
>> >> >>> mirroring + durable queues setup as False by default. This is because
>> >> >>> the migration between queue types (durable or quorum) can take several
>> >> >>> hours on larger deployments. So it might be worth making sure the
>> >> >>> default values only require one migration to quorum queues in the
>> >> >>> future, rather than two (durable queues now and then quorum queues in
>> >> >>> the future).
>> >> >>>
>> >> >>> We will need to make this switch eventually, but right now RabbitMQ
>> >> >>> 4.0 does not even have a set release date, so it's not the most urgent
>> >> >>> change.
>> >> >>>
>> >> >>> Cheers,
>> >> >>> Matt
>> >> >>>
>> >> >>> >Hi Michal,
>> >> >>> >
>> >> >>> >Feel free to propose change of default in master branch, but I don?t think we can change the default in stable branches without impacting users.
>> >> >>> >
>> >> >>> >Best regards,
>> >> >>> >Michal
>> >> >>> >
>> >> >>> >> On 11 Apr 2023, at 15:18, Michal Arbet <michal.arbet at ultimum.io> wrote:
>> >> >>> >>
>> >> >>> >> Hi,
>> >> >>> >>
>> >> >>> >> Btw, why we have such option set to false ?
>> >> >>> >> Michal Arbet
>> >> >>> >> Openstack Engineer
>> >> >>> >>
>> >> >>> >> Ultimum Technologies a.s.
>> >> >>> >> Na Po???? 1047/26, 11000 Praha 1
>> >> >>> >> Czech Republic
>> >> >>> >>
>> >> >>> >> +420 604 228 897 <>
>> >> >>> >> michal.arbet at ultimum.io <mailto:michal.arbet at ultimum.io>
>> >> >>> >> https://ultimum.io <https://ultimum.io/>
>> >> >>> >>
>> >> >>> >> LinkedIn <https://www.linkedin.com/company/ultimum-technologies> | Twitter <https://twitter.com/ultimumtech> | Facebook <https://www.facebook.com/ultimumtechnologies/timeline>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> ?t 11. 4. 2023 v 14:48 odes?latel Micha? Nasiadka <mnasiadka at gmail.com <mailto:mnasiadka at gmail.com>> napsal:
>> >> >>> >>> Hello,
>> >> >>> >>>
>> >> >>> >>> RabbitMQ HA has been backported into stable releases, and it?s documented here:
>> >> >>> >>> https://docs.openstack.org/kolla-ansible/yoga/reference/message-queues/rabbitmq.html#high-availability
>> >> >>> >>>
>> >> >>> >>> Best regards,
>> >> >>> >>> Michal
>> >> >>> >>>
>> >> >>> >>> W dniu wt., 11.04.2023 o 13:32 Nguy?n H?u Kh?i <nguyenhuukhoinw at gmail.com <mailto:nguyenhuukhoinw at gmail.com>> napisa?(a):
>> >> >>> >>>> Yes.
>> >> >>> >>>> But cluster cannot work properly without it. :(
>> >> >>> >>>>
>> >> >>> >>>> On Tue, Apr 11, 2023, 6:18 PM Danny Webb <Danny.Webb at thehutgroup.com <mailto:Danny.Webb at thehutgroup.com>> wrote:
>> >> >>> >>>>> This commit explains why they largely removed HA queue durability:
>> >> >>> >>>>>
>> >> >>> >>>>> https://opendev.org/openstack/kolla-ansible/commit/2764844ee2ff9393a4eebd90a9a912588af0a180
>> >> >>> >>>>> From: Satish Patel <satish.txt at gmail.com <mailto:satish.txt at gmail.com>>
>> >> >>> >>>>> Sent: 09 April 2023 04:16
>> >> >>> >>>>> To: Nguy?n H?u Kh?i <nguyenhuukhoinw at gmail.com <mailto:nguyenhuukhoinw at gmail.com>>
>> >> >>> >>>>> Cc: OpenStack Discuss <openstack-discuss at lists.openstack.org <mailto:openstack-discuss at lists.openstack.org>>
>> >> >>> >>>>> Subject: Re: [openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down
>> >> >>> >>>>>
>> >> >>> >>>>>
>> >> >>> >>>>> CAUTION: This email originates from outside THG
>> >> >>> >>>>>
>> >> >>> >>>>> Are you proposing a solution or just raising an issue?
>> >> >>> >>>>>
>> >> >>> >>>>> I did find it strange that kolla-ansible doesn't support HA queue by default. That is a disaster because when one of the nodes goes down it will make the whole rabbitMQ unacceptable. Whenever i deploy kolla i have to add HA policy to make queue HA otherwise you will endup in problem.
>> >> >>> >>>>>
>> >> >>> >>>>> On Sat, Apr 8, 2023 at 6:40?AM Nguy?n H?u Kh?i <nguyenhuukhoinw at gmail.com <mailto:nguyenhuukhoinw at gmail.com>> wrote:
>> >> >>> >>>>> Hello everyone.
>> >> >>> >>>>>
>> >> >>> >>>>> I want to summary for who meets problems with Openstack when deploy cluster with 3 controller using Kolla Ansible
>> >> >>> >>>>>
>> >> >>> >>>>> Scenario: 1 of 3 controller is down
>> >> >>> >>>>>
>> >> >>> >>>>> 1. Login horizon and use API such as nova, cinder will be very slow
>> >> >>> >>>>>
>> >> >>> >>>>> fix by:
>> >> >>> >>>>>
>> >> >>> >>>>> nano:
>> >> >>> >>>>> kolla-ansible/ansible/roles/heat/templates/heat.conf.j2
>> >> >>> >>>>> kolla-ansible/ansible/roles/nova/templates/nova.conf.j2
>> >> >>> >>>>> kolla-ansible/ansible/roles/keystone/templates/keystone.conf.j2
>> >> >>> >>>>> kolla-ansible/ansible/roles/neutron/templates/neutron.conf.j2
>> >> >>> >>>>> kolla-ansible/ansible/roles/cinder/templates/cinder.conf.j2
>> >> >>> >>>>>
>> >> >>> >>>>> or which service need caches
>> >> >>> >>>>>
>> >> >>> >>>>> add as below
>> >> >>> >>>>>
>> >> >>> >>>>> [cache]
>> >> >>> >>>>> backend = oslo_cache.memcache_pool
>> >> >>> >>>>> enabled = True
>> >> >>> >>>>> memcache_servers = {{ kolla_internal_vip_address }}:{{ memcached_port }}
>> >> >>> >>>>> memcache_dead_retry = 0.25
>> >> >>> >>>>> memcache_socket_timeout = 900
>> >> >>> >>>>>
>> >> >>> >>>>> https://review.opendev.org/c/openstack/kolla-ansible/+/849487
>> >> >>> >>>>>
>> >> >>> >>>>> but it is not the end
>> >> >>> >>>>>
>> >> >>> >>>>> 2. Cannot launch instance or mapping block device(stuck at this step)
>> >> >>> >>>>>
>> >> >>> >>>>> nano kolla-ansible/ansible/roles/rabbitmq/templates/definitions.json.j2
>> >> >>> >>>>>
>> >> >>> >>>>> "policies":[
>> >> >>> >>>>> {"vhost": "/", "name": "ha-all", "pattern": "^(?!(amq\.)|(.*_fanout_)|(reply_)).*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %},
>> >> >>> >>>>> {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}
>> >> >>> >>>>> {% endif %}
>> >> >>> >>>>> ]
>> >> >>> >>>>>
>> >> >>> >>>>> nano /etc/kollla/global.conf
>> >> >>> >>>>>
>> >> >>> >>>>> [oslo_messaging_rabbit]
>> >> >>> >>>>> kombu_reconnect_delay=0.5
>> >> >>> >>>>>
>> >> >>> >>>>>
>> >> >>> >>>>> https://bugs.launchpad.net/oslo.messaging/+bug/1993149
>> >> >>> >>>>> https://docs.openstack.org/large-scale/journey/configure/rabbitmq.html
>> >> >>> >>>>>
>> >> >>> >>>>> I used Xena 13.4 and Yoga 14.8.1.
>> >> >>> >>>>>
>> >> >>> >>>>> Above bugs are critical, but I see that it was not fixed. I am just an operator and I want to share what I encountered for new people who come to Openstack
>> >> >>> >>>>>
>> >> >>> >>>>>
>> >> >>> >>>>> Nguyen Huu Khoi
>> >> >>> >>> --
>> >> >>> >>> Micha? Nasiadka
>> >> >>> >>> mnasiadka at gmail.com <mailto:mnasiadka at gmail.com>
>> >> >>>



More information about the openstack-discuss mailing list