[openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down

Matt Crees mattc at stackhpc.com
Wed Apr 12 08:20:58 UTC 2023


Hi all,

It seems worth noting here that there is a fix ongoing in
oslo.messaging which will resolve the issues with HA failing when one
node is down. See here:
https://review.opendev.org/c/openstack/oslo.messaging/+/866617
In the meantime, we have also found that setting kombu_reconnect_delay
= 0.5 does resolve this issue.

As for why om_enable_rabbitmq_high_availability is currently
defaulting to false, as Michal said enabling it in stable releases
will impact users. This is because it enables durable queues, and the
migration from transient to durable queues is not a seamless
procedure. It requires that the state of RabbitMQ is reset and that
the OpenStack services which use RabbitMQ are restarted to recreate
the queues.

I think that there is some merit in changing this default value. But
if we did this, we should either add additional support to automate
the migration from transient to durable queues, or at the very least
provide some decent docs on the manual procedure.

However, as classic queue mirroring is deprecated in RabbitMQ (to be
removed in RabbitMQ 4.0) we should maybe consider switching to quorum
queues soon. Then it may be beneficial to leave the classic queue
mirroring + durable queues setup as False by default. This is because
the migration between queue types (durable or quorum) can take several
hours on larger deployments. So it might be worth making sure the
default values only require one migration to quorum queues in the
future, rather than two (durable queues now and then quorum queues in
the future).

We will need to make this switch eventually, but right now RabbitMQ
4.0 does not even have a set release date, so it's not the most urgent
change.

Cheers,
Matt

>Hi Michal,
>
>Feel free to propose change of default in master branch, but I don?t think we can change the default in stable branches without impacting users.
>
>Best regards,
>Michal
>
>> On 11 Apr 2023, at 15:18, Michal Arbet <michal.arbet at ultimum.io> wrote:
>>
>> Hi,
>>
>> Btw, why we have such option set to false ?
>> Michal Arbet
>> Openstack Engineer
>>
>> Ultimum Technologies a.s.
>> Na Po???? 1047/26, 11000 Praha 1
>> Czech Republic
>>
>> +420 604 228 897 <>
>> michal.arbet at ultimum.io <mailto:michal.arbet at ultimum.io>
>> https://ultimum.io <https://ultimum.io/>
>>
>> LinkedIn <https://www.linkedin.com/company/ultimum-technologies> | Twitter <https://twitter.com/ultimumtech> | Facebook <https://www.facebook.com/ultimumtechnologies/timeline>
>>
>>
>> ?t 11. 4. 2023 v 14:48 odes?latel Micha? Nasiadka <mnasiadka at gmail.com <mailto:mnasiadka at gmail.com>> napsal:
>>> Hello,
>>>
>>> RabbitMQ HA has been backported into stable releases, and it?s documented here:
>>> https://docs.openstack.org/kolla-ansible/yoga/reference/message-queues/rabbitmq.html#high-availability
>>>
>>> Best regards,
>>> Michal
>>>
>>> W dniu wt., 11.04.2023 o 13:32 Nguy?n H?u Kh?i <nguyenhuukhoinw at gmail.com <mailto:nguyenhuukhoinw at gmail.com>> napisa?(a):
>>>> Yes.
>>>> But cluster cannot work properly without it. :(
>>>>
>>>> On Tue, Apr 11, 2023, 6:18 PM Danny Webb <Danny.Webb at thehutgroup.com <mailto:Danny.Webb at thehutgroup.com>> wrote:
>>>>> This commit explains why they largely removed HA queue durability:
>>>>>
>>>>> https://opendev.org/openstack/kolla-ansible/commit/2764844ee2ff9393a4eebd90a9a912588af0a180
>>>>> From: Satish Patel <satish.txt at gmail.com <mailto:satish.txt at gmail.com>>
>>>>> Sent: 09 April 2023 04:16
>>>>> To: Nguy?n H?u Kh?i <nguyenhuukhoinw at gmail.com <mailto:nguyenhuukhoinw at gmail.com>>
>>>>> Cc: OpenStack Discuss <openstack-discuss at lists.openstack.org <mailto:openstack-discuss at lists.openstack.org>>
>>>>> Subject: Re: [openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down
>>>>>
>>>>>
>>>>> CAUTION: This email originates from outside THG
>>>>>
>>>>> Are you proposing a solution or just raising an issue?
>>>>>
>>>>> I did find it strange that kolla-ansible doesn't support HA queue by default. That is a disaster because when one of the nodes goes down it will make the whole rabbitMQ unacceptable. Whenever i deploy kolla i have to add HA policy to make queue HA otherwise you will endup in problem.
>>>>>
>>>>> On Sat, Apr 8, 2023 at 6:40?AM Nguy?n H?u Kh?i <nguyenhuukhoinw at gmail.com <mailto:nguyenhuukhoinw at gmail.com>> wrote:
>>>>> Hello everyone.
>>>>>
>>>>> I want to summary for who meets problems with Openstack when deploy cluster with 3 controller using Kolla Ansible
>>>>>
>>>>> Scenario: 1 of 3 controller is down
>>>>>
>>>>> 1. Login horizon and use API such as nova, cinder will be very slow
>>>>>
>>>>> fix by:
>>>>>
>>>>> nano:
>>>>> kolla-ansible/ansible/roles/heat/templates/heat.conf.j2
>>>>> kolla-ansible/ansible/roles/nova/templates/nova.conf.j2
>>>>> kolla-ansible/ansible/roles/keystone/templates/keystone.conf.j2
>>>>> kolla-ansible/ansible/roles/neutron/templates/neutron.conf.j2
>>>>> kolla-ansible/ansible/roles/cinder/templates/cinder.conf.j2
>>>>>
>>>>> or which service need caches
>>>>>
>>>>> add as below
>>>>>
>>>>> [cache]
>>>>> backend = oslo_cache.memcache_pool
>>>>> enabled = True
>>>>> memcache_servers = {{ kolla_internal_vip_address }}:{{ memcached_port }}
>>>>> memcache_dead_retry = 0.25
>>>>> memcache_socket_timeout = 900
>>>>>
>>>>> https://review.opendev.org/c/openstack/kolla-ansible/+/849487
>>>>>
>>>>> but it is not the end
>>>>>
>>>>> 2. Cannot launch instance or mapping block device(stuck at this step)
>>>>>
>>>>> nano kolla-ansible/ansible/roles/rabbitmq/templates/definitions.json.j2
>>>>>
>>>>> "policies":[
>>>>> {"vhost": "/", "name": "ha-all", "pattern": "^(?!(amq\.)|(.*_fanout_)|(reply_)).*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %},
>>>>> {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}
>>>>> {% endif %}
>>>>> ]
>>>>>
>>>>> nano /etc/kollla/global.conf
>>>>>
>>>>> [oslo_messaging_rabbit]
>>>>> kombu_reconnect_delay=0.5
>>>>>
>>>>>
>>>>> https://bugs.launchpad.net/oslo.messaging/+bug/1993149
>>>>> https://docs.openstack.org/large-scale/journey/configure/rabbitmq.html
>>>>>
>>>>> I used Xena 13.4 and Yoga 14.8.1.
>>>>>
>>>>> Above bugs are critical, but I see that it was not fixed. I am just an operator and I want to share what I encountered for new people who come to Openstack
>>>>>
>>>>>
>>>>> Nguyen Huu Khoi
>>> --
>>> Micha? Nasiadka
>>> mnasiadka at gmail.com <mailto:mnasiadka at gmail.com>



More information about the openstack-discuss mailing list