[openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down

Nguyễn Hữu Khôi nguyenhuukhoinw at gmail.com
Mon Apr 24 06:17:13 UTC 2023


Hi everyone,
I update that. With failover controller, we dont need use mirror queue if
we use ceph as backend. In my case, I use SAN as cinder backend and NFS as
glance backend so I need mirror queue. It is quite weird.
Nguyen Huu Khoi


On Mon, Apr 17, 2023 at 5:03 PM Doug Szumski <doug at stackhpc.com> wrote:

> On 13/04/2023 23:04, Satish Patel wrote:
> > Thank Sean/Matt,
> >
> > This is interesting that I have only option to use classic mirror with
> durable :( because without mirror cluster acting up when one of node is
> down.
>
> If you have planned maintenance and non-mirrored transient queues, you
> can first try draining the node to be removed, before removing it from
> the cluster. In my testing at least, this appears to be much more
> successful than relying on the RabbitMQ clients to do the failover and
> recreate queues.  See [1], or for RMQ <3.8 you can cobble something
> together with ha-mode nodes [2].
>
> [1] https://www.rabbitmq.com/upgrade.html#maintenance-mode
>
> [2] https://www.rabbitmq.com/ha.html#mirroring-arguments
>
> This obviously doesn't solve the case of when a controller fails
> unexpectedly.
>
> I also think it's worth making the distinction between a highly
> available messaging infrastructure, and queue mirroring. In many cases,
> if a RMQ node hosting a non-mirrored, transient queue goes down, it
> should be possible for a service to just recreate the queue on another
> node and retry. This often seems to fail, which leads to queue mirroring
> getting turned on.
>
> >
> > How people are scaling rabbitMQ for large scale?
> >
> > Sent from my iPhone
> >
> >> On Apr 13, 2023, at 7:50 AM, Sean Mooney <smooney at redhat.com> wrote:
> >>
> >> On Thu, 2023-04-13 at 09:07 +0100, Matt Crees wrote:
> >>> Hi all, I'll reply in turn here:
> >>>
> >>> Radek, I agree it definitely will be a good addition to the KA docs.
> >>> I've got it on my radar, will aim to get a patch proposed this week.
> >>>
> >>> Satish, I haven't personally been able to test durable queues on a
> >>> system that large. According to the RabbitMQ docs
> >>> (https://www.rabbitmq.com/queues.html#durability), "Throughput and
> >>> latency of a queue is not affected by whether a queue is durable or
> >>> not in most cases." However, I have anecdotally heard that it can
> >>> affect some performance in particularly large systems.
> >> the perfomacne of durable queue is dominated by the disk io
> >> put rabbit on a pcie nvme ssd and it will have littel effect
> >> use spinnign rust (a HDD even in raid 10) and the iops/throput of the
> >> storage sued to make the qure durible (that just mean write all
> messages to disk)
> >> will be the bottlenck for scaleablity
> >> combine that with HA and that is worse as it has to be writen to
> multipel servers.
> >> i belive that teh mirror implemantion will wait for all copies to be
> persisted
> >> but i have not really looked into it. it was raised as a pain point
> with rabbit by
> >> operator in the past in terms fo scaling.
> >>
> >>> Please note that if you are using the classic mirrored queues, you
> >>> must also have them durable. Transient (i.e. non-durable) mirrored
> >>> queues are not a supported feature and do cause bugs. (For example the
> >>> "old incarnation" errors seen here:
> >>> https://bugs.launchpad.net/kolla-ansible/+bug/1954925)
> >>>
> >>> Nguyen, I can confirm that we've seen the same behaviour. This is
> >>> caused by a backported change in oslo.messaging (I believe you linked
> >>> the relevant bug report previously). There is a fix in progress to fix
> >>> this (https://review.opendev.org/c/openstack/oslo.messaging/+/866617),
> >>> and in the meantime setting kombu_reconnect_delay < 1.0 does resolve
> >>> it.
> >>>
> >>> Cheers,
> >>> Matt
> >>>
> >>>> On Thu, 13 Apr 2023 at 03:04, Nguyễn Hữu Khôi <
> nguyenhuukhoinw at gmail.com> wrote:
> >>>>
> >>>> update:
> >>>> I use SAN as Cinder backend.
> >>>> Nguyen Huu Khoi
> >>>>
> >>>>
> >>>> On Thu, Apr 13, 2023 at 9:02 AM Nguyễn Hữu Khôi <
> nguyenhuukhoinw at gmail.com> wrote:
> >>>>> Hello guys.
> >>>>>
> >>>>> I do many tests on xena and yoga. then i am sure that without
> ha-queue and kombu_reconnect_delay=0.5(it can < 1)
> >>>>> you cannot launch instances when 1 of 3 controllers is down.
> >>>>> Somebody can verify what I say, I hope we will have a common
> solution for this problem because those who use openstack for the first
> time will continue to ask questions like that.
> >>>>>
> >>>>> Nguyen Huu Khoi
> >>>>>
> >>>>>
> >>>>> On Thu, Apr 13, 2023 at 12:59 AM Satish Patel <satish.txt at gmail.com>
> wrote:
> >>>>>> This is great! Matt,
> >>>>>>
> >>>>>> Documentation would be greatly appreciated. I have a counter
> question: does Durable queue be good for large clouds with 1000 compute
> nodes or better to not use durable queue. This is a private cloud and we
> don't care about persistent data.
> >>>>>>
> >>>>>> On Wed, Apr 12, 2023 at 12:37 PM Radosław Piliszek <
> radoslaw.piliszek at gmail.com> wrote:
> >>>>>>> Hi Matt,
> >>>>>>>
> >>>>>>>> As you're now reconfiguring a running deployment, there are
> >>>>>>>> some extra steps that need to be taken to migrate to durable
> queues.
> >>>>>>>>
> >>>>>>>> 1. You will need to stop the OpenStack services which use
> RabbitMQ.
> >>>>>>>>
> >>>>>>>> 2. Reset the state of RabbitMQ one each rabbit node with the
> following
> >>>>>>>> commands. You must run each command on all RabbitMQ nodes before
> >>>>>>>> moving on to the next command. This will remove all queues.
> >>>>>>>>
> >>>>>>>>     rabbitmqctl stop_app
> >>>>>>>>     rabbitmqctl force_reset
> >>>>>>>>     rabbitmqctl start_app
> >>>>>>>>
> >>>>>>>> 3. Start the OpenStack services again, at which point they will
> >>>>>>>> recreate the appropriate queues as durable.
> >>>>>>> This sounds like a great new addition-to-be to the Kolla Ansible
> docs!
> >>>>>>> Could you please propose it as a change?
> >>>>>>>
> >>>>>>> Kindest,
> >>>>>>> Radek
> >>>>>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230424/2f07cbdb/attachment.htm>


More information about the openstack-discuss mailing list