[openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down

Satish Patel satish.txt at gmail.com
Wed Apr 12 14:34:44 UTC 2023


Hi Matt,

How do I set kombu_reconnect_delay=0.5 option?

Something like the following in global.yml?

kombu_reconnect_delay: 0.5

On Wed, Apr 12, 2023 at 4:23 AM Matt Crees <mattc at stackhpc.com> wrote:

> Hi all,
>
> It seems worth noting here that there is a fix ongoing in
> oslo.messaging which will resolve the issues with HA failing when one
> node is down. See here:
> https://review.opendev.org/c/openstack/oslo.messaging/+/866617
> In the meantime, we have also found that setting kombu_reconnect_delay
> = 0.5 does resolve this issue.
>
> As for why om_enable_rabbitmq_high_availability is currently
> defaulting to false, as Michal said enabling it in stable releases
> will impact users. This is because it enables durable queues, and the
> migration from transient to durable queues is not a seamless
> procedure. It requires that the state of RabbitMQ is reset and that
> the OpenStack services which use RabbitMQ are restarted to recreate
> the queues.
>
> I think that there is some merit in changing this default value. But
> if we did this, we should either add additional support to automate
> the migration from transient to durable queues, or at the very least
> provide some decent docs on the manual procedure.
>
> However, as classic queue mirroring is deprecated in RabbitMQ (to be
> removed in RabbitMQ 4.0) we should maybe consider switching to quorum
> queues soon. Then it may be beneficial to leave the classic queue
> mirroring + durable queues setup as False by default. This is because
> the migration between queue types (durable or quorum) can take several
> hours on larger deployments. So it might be worth making sure the
> default values only require one migration to quorum queues in the
> future, rather than two (durable queues now and then quorum queues in
> the future).
>
> We will need to make this switch eventually, but right now RabbitMQ
> 4.0 does not even have a set release date, so it's not the most urgent
> change.
>
> Cheers,
> Matt
>
> >Hi Michal,
> >
> >Feel free to propose change of default in master branch, but I don?t
> think we can change the default in stable branches without impacting users.
> >
> >Best regards,
> >Michal
> >
> >> On 11 Apr 2023, at 15:18, Michal Arbet <michal.arbet at ultimum.io> wrote:
> >>
> >> Hi,
> >>
> >> Btw, why we have such option set to false ?
> >> Michal Arbet
> >> Openstack Engineer
> >>
> >> Ultimum Technologies a.s.
> >> Na Po???? 1047/26, 11000 Praha 1
> >> Czech Republic
> >>
> >> +420 604 228 897 <>
> >> michal.arbet at ultimum.io <mailto:michal.arbet at ultimum.io>
> >> https://ultimum.io <https://ultimum.io/>
> >>
> >> LinkedIn <https://www.linkedin.com/company/ultimum-technologies> |
> Twitter <https://twitter.com/ultimumtech> | Facebook <
> https://www.facebook.com/ultimumtechnologies/timeline>
> >>
> >>
> >> ?t 11. 4. 2023 v 14:48 odes?latel Micha? Nasiadka <mnasiadka at gmail.com
> <mailto:mnasiadka at gmail.com>> napsal:
> >>> Hello,
> >>>
> >>> RabbitMQ HA has been backported into stable releases, and it?s
> documented here:
> >>>
> https://docs.openstack.org/kolla-ansible/yoga/reference/message-queues/rabbitmq.html#high-availability
> >>>
> >>> Best regards,
> >>> Michal
> >>>
> >>> W dniu wt., 11.04.2023 o 13:32 Nguy?n H?u Kh?i <
> nguyenhuukhoinw at gmail.com <mailto:nguyenhuukhoinw at gmail.com>> napisa?(a):
> >>>> Yes.
> >>>> But cluster cannot work properly without it. :(
> >>>>
> >>>> On Tue, Apr 11, 2023, 6:18 PM Danny Webb <Danny.Webb at thehutgroup.com
> <mailto:Danny.Webb at thehutgroup.com>> wrote:
> >>>>> This commit explains why they largely removed HA queue durability:
> >>>>>
> >>>>>
> https://opendev.org/openstack/kolla-ansible/commit/2764844ee2ff9393a4eebd90a9a912588af0a180
> >>>>> From: Satish Patel <satish.txt at gmail.com <mailto:
> satish.txt at gmail.com>>
> >>>>> Sent: 09 April 2023 04:16
> >>>>> To: Nguy?n H?u Kh?i <nguyenhuukhoinw at gmail.com <mailto:
> nguyenhuukhoinw at gmail.com>>
> >>>>> Cc: OpenStack Discuss <openstack-discuss at lists.openstack.org
> <mailto:openstack-discuss at lists.openstack.org>>
> >>>>> Subject: Re: [openstack][sharing][kolla ansible]Problems when 1 of 3
> controller was be down
> >>>>>
> >>>>>
> >>>>> CAUTION: This email originates from outside THG
> >>>>>
> >>>>> Are you proposing a solution or just raising an issue?
> >>>>>
> >>>>> I did find it strange that kolla-ansible doesn't support HA queue by
> default. That is a disaster because when one of the nodes goes down it will
> make the whole rabbitMQ unacceptable. Whenever i deploy kolla i have to add
> HA policy to make queue HA otherwise you will endup in problem.
> >>>>>
> >>>>> On Sat, Apr 8, 2023 at 6:40?AM Nguy?n H?u Kh?i <
> nguyenhuukhoinw at gmail.com <mailto:nguyenhuukhoinw at gmail.com>> wrote:
> >>>>> Hello everyone.
> >>>>>
> >>>>> I want to summary for who meets problems with Openstack when deploy
> cluster with 3 controller using Kolla Ansible
> >>>>>
> >>>>> Scenario: 1 of 3 controller is down
> >>>>>
> >>>>> 1. Login horizon and use API such as nova, cinder will be very slow
> >>>>>
> >>>>> fix by:
> >>>>>
> >>>>> nano:
> >>>>> kolla-ansible/ansible/roles/heat/templates/heat.conf.j2
> >>>>> kolla-ansible/ansible/roles/nova/templates/nova.conf.j2
> >>>>> kolla-ansible/ansible/roles/keystone/templates/keystone.conf.j2
> >>>>> kolla-ansible/ansible/roles/neutron/templates/neutron.conf.j2
> >>>>> kolla-ansible/ansible/roles/cinder/templates/cinder.conf.j2
> >>>>>
> >>>>> or which service need caches
> >>>>>
> >>>>> add as below
> >>>>>
> >>>>> [cache]
> >>>>> backend = oslo_cache.memcache_pool
> >>>>> enabled = True
> >>>>> memcache_servers = {{ kolla_internal_vip_address }}:{{
> memcached_port }}
> >>>>> memcache_dead_retry = 0.25
> >>>>> memcache_socket_timeout = 900
> >>>>>
> >>>>> https://review.opendev.org/c/openstack/kolla-ansible/+/849487
> >>>>>
> >>>>> but it is not the end
> >>>>>
> >>>>> 2. Cannot launch instance or mapping block device(stuck at this step)
> >>>>>
> >>>>> nano
> kolla-ansible/ansible/roles/rabbitmq/templates/definitions.json.j2
> >>>>>
> >>>>> "policies":[
> >>>>> {"vhost": "/", "name": "ha-all", "pattern":
> "^(?!(amq\.)|(.*_fanout_)|(reply_)).*", "apply-to": "all", "definition":
> {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %},
> >>>>> {"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all",
> "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"},
> "priority":0}
> >>>>> {% endif %}
> >>>>> ]
> >>>>>
> >>>>> nano /etc/kollla/global.conf
> >>>>>
> >>>>> [oslo_messaging_rabbit]
> >>>>> kombu_reconnect_delay=0.5
> >>>>>
> >>>>>
> >>>>> https://bugs.launchpad.net/oslo.messaging/+bug/1993149
> >>>>>
> https://docs.openstack.org/large-scale/journey/configure/rabbitmq.html
> >>>>>
> >>>>> I used Xena 13.4 and Yoga 14.8.1.
> >>>>>
> >>>>> Above bugs are critical, but I see that it was not fixed. I am just
> an operator and I want to share what I encountered for new people who come
> to Openstack
> >>>>>
> >>>>>
> >>>>> Nguyen Huu Khoi
> >>> --
> >>> Micha? Nasiadka
> >>> mnasiadka at gmail.com <mailto:mnasiadka at gmail.com>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230412/fdf8bf42/attachment.htm>


More information about the openstack-discuss mailing list