[openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down

Matt Crees mattc at stackhpc.com
Wed Apr 12 15:10:05 UTC 2023


Hi Satish,

Yes for a new deployment you will just need to set that variable to
true. However, that will enable the high availability of RabbitMQ
queues using a combination of classic queue mirroring and durable
queues.
Quorum queues are not yet supported via Kolla Ansible.

Cheers,
Matt

On Wed, 12 Apr 2023 at 16:04, Satish Patel <satish.txt at gmail.com> wrote:
>
> Matt,
>
> For new deployment how do I enable the Quorum queue?
>
> Just adding the following should be enough?
>
> om_enable_rabbitmq_high_availability: True
>
> On Wed, Apr 12, 2023 at 10:54 AM Matt Crees <mattc at stackhpc.com> wrote:
>>
>> Yes, and the option also needs to be under the oslo_messaging_rabbit heading:
>>
>> [oslo_messaging_rabbit]
>> kombu_reconnect_delay=0.5
>>
>>
>> On Wed, 12 Apr 2023 at 15:45, Nguyễn Hữu Khôi <nguyenhuukhoinw at gmail.com> wrote:
>> >
>> > Hi.
>> > Create global.conf in /etc/kolla/config/
>> >
>> > On Wed, Apr 12, 2023, 9:42 PM Satish Patel <satish.txt at gmail.com> wrote:
>> >>
>> >> Hi Matt,
>> >>
>> >> How do I set kombu_reconnect_delay=0.5 option?
>> >>
>> >> Something like the following in global.yml?
>> >>
>> >> kombu_reconnect_delay: 0.5
>> >>
>> >> On Wed, Apr 12, 2023 at 4:23 AM Matt Crees <mattc at stackhpc.com> wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> It seems worth noting here that there is a fix ongoing in
>> >>> oslo.messaging which will resolve the issues with HA failing when one
>> >>> node is down. See here:
>> >>> https://review.opendev.org/c/openstack/oslo.messaging/+/866617
>> >>> In the meantime, we have also found that setting kombu_reconnect_delay
>> >>> = 0.5 does resolve this issue.
>> >>>
>> >>> As for why om_enable_rabbitmq_high_availability is currently
>> >>> defaulting to false, as Michal said enabling it in stable releases
>> >>> will impact users. This is because it enables durable queues, and the
>> >>> migration from transient to durable queues is not a seamless
>> >>> procedure. It requires that the state of RabbitMQ is reset and that
>> >>> the OpenStack services which use RabbitMQ are restarted to recreate
>> >>> the queues.
>> >>>
>> >>> I think that there is some merit in changing this default value. But
>> >>> if we did this, we should either add additional support to automate
>> >>> the migration from transient to durable queues, or at the very least
>> >>> provide some decent docs on the manual procedure.
>> >>>
>> >>> However, as classic queue mirroring is deprecated in RabbitMQ (to be
>> >>> removed in RabbitMQ 4.0) we should maybe consider switching to quorum
>> >>> queues soon. Then it may be beneficial to leave the classic queue
>> >>> mirroring + durable queues setup as False by default. This is because
>> >>> the migration between queue types (durable or quorum) can take several
>> >>> hours on larger deployments. So it might be worth making sure the
>> >>> default values only require one migration to quorum queues in the
>> >>> future, rather than two (durable queues now and then quorum queues in
>> >>> the future).
>> >>>
>> >>> We will need to make this switch eventually, but right now RabbitMQ
>> >>> 4.0 does not even have a set release date, so it's not the most urgent
>> >>> change.
>> >>>
>> >>> Cheers,
>> >>> Matt
>> >>>
>> >>> >Hi Michal,
>> >>> >
>> >>> >Feel free to propose change of default in master branch, but I don?t think we can change the default in stable branches without impacting users.
>> >>> >
>> >>> >Best regards,
>> >>> >Michal
>> >>> >
>> >>> >> On 11 Apr 2023, at 15:18, Michal Arbet <michal.arbet at ultimum.io> wrote:
>> >>> >>
>> >>> >> Hi,
>> >>> >>
>> >>> >> Btw, why we have such option set to false ?
>> >>> >> Michal Arbet
>> >>> >> Openstack Engineer
>> >>> >>
>> >>> >> Ultimum Technologies a.s.
>> >>> >> Na Po???? 1047/26, 11000 Praha 1
>> >>> >> Czech Republic
>> >>> >>
>> >>> >> +420 604 228 897 <>
>> >>> >> michal.arbet at ultimum.io <mailto:michal.arbet at ultimum.io>
>> >>> >> https://ultimum.io <https://ultimum.io/>
>> >>> >>
>> >>> >> LinkedIn <https://www.linkedin.com/company/ultimum-technologies> | Twitter <https://twitter.com/ultimumtech> | Facebook <https://www.facebook.com/ultimumtechnologies/timeline>
>> >>> >>
>> >>> >>
>> >>> >> ?t 11. 4. 2023 v 14:48 odes?latel Micha? Nasiadka <mnasiadka at gmail.com <mailto:mnasiadka at gmail.com>> napsal:
>> >>> >>> Hello,
>> >>> >>>
>> >>> >>> RabbitMQ HA has been backported into stable releases, and it?s documented here:
>> >>> >>> https://docs.openstack.org/kolla-ansible/yoga/reference/message-queues/rabbitmq.html#high-availability
>> >>> >>>
>> >>> >>> Best regards,
>> >>> >>> Michal
>> >>> >>>
>> >>> >>> W dniu wt., 11.04.2023 o 13:32 Nguy?n H?u Kh?i <nguyenhuukhoinw at gmail.com <mailto:nguyenhuukhoinw at gmail.com>> napisa?(a):
>> >>> >>>> Yes.
>> >>> >>>> But cluster cannot work properly without it. :(
>> >>> >>>>
>> >>> >>>> On Tue, Apr 11, 2023, 6:18 PM Danny Webb <Danny.Webb at thehutgroup.com <mailto:Danny.Webb at thehutgroup.com>> wrote:
>> >>> >>>>> This commit explains why they largely removed HA queue durability:
>> >>> >>>>>
>> >>> >>>>> https://opendev.org/openstack/kolla-ansible/commit/2764844ee2ff9393a4eebd90a9a912588af0a180
>> >>> >>>>> From: Satish Patel <satish.txt at gmail.com <mailto:satish.txt at gmail.com>>
>> >>> >>>>> Sent: 09 April 2023 04:16
>> >>> >>>>> To: Nguy?n H?u Kh?i <nguyenhuukhoinw at gmail.com <mailto:nguyenhuukhoinw at gmail.com>>
>> >>> >>>>> Cc: OpenStack Discuss <openstack-discuss at lists.openstack.org <mailto:openstack-discuss at lists.openstack.org>>
>> >>> >>>>> Subject: Re: [openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>> CAUTION: This email originates from outside THG
>> >>> >>>>>
>> >>> >>>>> Are you proposing a solution or just raising an issue?
>> >>> >>>>>
>> >>> >>>>> I did find it strange that kolla-ansible doesn't support HA queue by default. That is a disaster because when one of the nodes goes down it will make the whole rabbitMQ unacceptable. Whenever i deploy kolla i have to add HA policy to make queue HA otherwise you will endup in problem.
>> >>> >>>>>
>> >>> >>>>> On Sat, Apr 8, 2023 at 6:40?AM Nguy?n H?u Kh?i <nguyenhuukhoinw at gmail.com <mailto:nguyenhuukhoinw at gmail.com>> wrote:
>> >>> >>>>> Hello everyone.
>> >>> >>>>>
>> >>> >>>>> I want to summary for who meets problems with Openstack when deploy cluster with 3 controller using Kolla Ansible
>> >>> >>>>>
>> >>> >>>>> Scenario: 1 of 3 controller is down
>> >>> >>>>>
>> >>> >>>>> 1. Login horizon and use API such as nova, cinder will be very slow
>> >>> >>>>>
>> >>> >>>>> fix by:
>> >>> >>>>>
>> >>> >>>>> nano:
>> >>> >>>>> kolla-ansible/ansible/roles/heat/templates/heat.conf.j2
>> >>> >>>>> kolla-ansible/ansible/roles/nova/templates/nova.conf.j2
>> >>> >>>>> kolla-ansible/ansible/roles/keystone/templates/keystone.conf.j2
>> >>> >>>>> kolla-ansible/ansible/roles/neutron/templates/neutron.conf.j2
>> >>> >>>>> kolla-ansible/ansible/roles/cinder/templates/cinder.conf.j2
>> >>> >>>>>
>> >>> >>>>> or which service need caches
>> >>> >>>>>
>> >>> >>>>> add as below
>> >>> >>>>>
>> >>> >>>>> [cache]
>> >>> >>>>> backend = oslo_cache.memcache_pool
>> >>> >>>>> enabled = True
>> >>> >>>>> memcache_servers = {{ kolla_internal_vip_address }}:{{ memcached_port }}
>> >>> >>>>> memcache_dead_retry = 0.25
>> >>> >>>>> memcache_socket_timeout = 900
>> >>> >>>>>
>> >>> >>>>> https://review.opendev.org/c/openstack/kolla-ansible/+/849487
>> >>> >>>>>
>> >>> >>>>> but it is not the end
>> >>> >>>>>
>> >>> >>>>> 2. Cannot launch instance or mapping block device(stuck at this step)
>> >>> >>>>>
>> >>> >>>>> nano kolla-ansible/ansible/roles/rabbitmq/templates/definitions.json.j2
>> >>> >>>>>
>> >>> >>>>> "policies":[
>> >>> >>>>> {"vhost": "/", "name": "ha-all", "pattern": "^(?!(amq\.)|(.*_fanout_)|(reply_)).*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %},
>> >>> >>>>> {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}
>> >>> >>>>> {% endif %}
>> >>> >>>>> ]
>> >>> >>>>>
>> >>> >>>>> nano /etc/kollla/global.conf
>> >>> >>>>>
>> >>> >>>>> [oslo_messaging_rabbit]
>> >>> >>>>> kombu_reconnect_delay=0.5
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>> https://bugs.launchpad.net/oslo.messaging/+bug/1993149
>> >>> >>>>> https://docs.openstack.org/large-scale/journey/configure/rabbitmq.html
>> >>> >>>>>
>> >>> >>>>> I used Xena 13.4 and Yoga 14.8.1.
>> >>> >>>>>
>> >>> >>>>> Above bugs are critical, but I see that it was not fixed. I am just an operator and I want to share what I encountered for new people who come to Openstack
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>> Nguyen Huu Khoi
>> >>> >>> --
>> >>> >>> Micha? Nasiadka
>> >>> >>> mnasiadka at gmail.com <mailto:mnasiadka at gmail.com>
>> >>>



More information about the openstack-discuss mailing list