[openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down

Satish Patel satish.txt at gmail.com
Mon Apr 24 17:34:53 UTC 2023


Oh wait! What is the relation of Rabbit mirror queue with Ceph/NFS or any
shared storage backend?

On Mon, Apr 24, 2023 at 2:17 AM Nguyễn Hữu Khôi <nguyenhuukhoinw at gmail.com>
wrote:

> Hi everyone,
> I update that. With failover controller, we dont need use mirror queue if
> we use ceph as backend. In my case, I use SAN as cinder backend and NFS as
> glance backend so I need mirror queue. It is quite weird.
> Nguyen Huu Khoi
>
>
> On Mon, Apr 17, 2023 at 5:03 PM Doug Szumski <doug at stackhpc.com> wrote:
>
>> On 13/04/2023 23:04, Satish Patel wrote:
>> > Thank Sean/Matt,
>> >
>> > This is interesting that I have only option to use classic mirror with
>> durable :( because without mirror cluster acting up when one of node is
>> down.
>>
>> If you have planned maintenance and non-mirrored transient queues, you
>> can first try draining the node to be removed, before removing it from
>> the cluster. In my testing at least, this appears to be much more
>> successful than relying on the RabbitMQ clients to do the failover and
>> recreate queues.  See [1], or for RMQ <3.8 you can cobble something
>> together with ha-mode nodes [2].
>>
>> [1] https://www.rabbitmq.com/upgrade.html#maintenance-mode
>>
>> [2] https://www.rabbitmq.com/ha.html#mirroring-arguments
>>
>> This obviously doesn't solve the case of when a controller fails
>> unexpectedly.
>>
>> I also think it's worth making the distinction between a highly
>> available messaging infrastructure, and queue mirroring. In many cases,
>> if a RMQ node hosting a non-mirrored, transient queue goes down, it
>> should be possible for a service to just recreate the queue on another
>> node and retry. This often seems to fail, which leads to queue mirroring
>> getting turned on.
>>
>> >
>> > How people are scaling rabbitMQ for large scale?
>> >
>> > Sent from my iPhone
>> >
>> >> On Apr 13, 2023, at 7:50 AM, Sean Mooney <smooney at redhat.com> wrote:
>> >>
>> >> On Thu, 2023-04-13 at 09:07 +0100, Matt Crees wrote:
>> >>> Hi all, I'll reply in turn here:
>> >>>
>> >>> Radek, I agree it definitely will be a good addition to the KA docs.
>> >>> I've got it on my radar, will aim to get a patch proposed this week.
>> >>>
>> >>> Satish, I haven't personally been able to test durable queues on a
>> >>> system that large. According to the RabbitMQ docs
>> >>> (https://www.rabbitmq.com/queues.html#durability), "Throughput and
>> >>> latency of a queue is not affected by whether a queue is durable or
>> >>> not in most cases." However, I have anecdotally heard that it can
>> >>> affect some performance in particularly large systems.
>> >> the perfomacne of durable queue is dominated by the disk io
>> >> put rabbit on a pcie nvme ssd and it will have littel effect
>> >> use spinnign rust (a HDD even in raid 10) and the iops/throput of the
>> >> storage sued to make the qure durible (that just mean write all
>> messages to disk)
>> >> will be the bottlenck for scaleablity
>> >> combine that with HA and that is worse as it has to be writen to
>> multipel servers.
>> >> i belive that teh mirror implemantion will wait for all copies to be
>> persisted
>> >> but i have not really looked into it. it was raised as a pain point
>> with rabbit by
>> >> operator in the past in terms fo scaling.
>> >>
>> >>> Please note that if you are using the classic mirrored queues, you
>> >>> must also have them durable. Transient (i.e. non-durable) mirrored
>> >>> queues are not a supported feature and do cause bugs. (For example the
>> >>> "old incarnation" errors seen here:
>> >>> https://bugs.launchpad.net/kolla-ansible/+bug/1954925)
>> >>>
>> >>> Nguyen, I can confirm that we've seen the same behaviour. This is
>> >>> caused by a backported change in oslo.messaging (I believe you linked
>> >>> the relevant bug report previously). There is a fix in progress to fix
>> >>> this (https://review.opendev.org/c/openstack/oslo.messaging/+/866617
>> ),
>> >>> and in the meantime setting kombu_reconnect_delay < 1.0 does resolve
>> >>> it.
>> >>>
>> >>> Cheers,
>> >>> Matt
>> >>>
>> >>>> On Thu, 13 Apr 2023 at 03:04, Nguyễn Hữu Khôi <
>> nguyenhuukhoinw at gmail.com> wrote:
>> >>>>
>> >>>> update:
>> >>>> I use SAN as Cinder backend.
>> >>>> Nguyen Huu Khoi
>> >>>>
>> >>>>
>> >>>> On Thu, Apr 13, 2023 at 9:02 AM Nguyễn Hữu Khôi <
>> nguyenhuukhoinw at gmail.com> wrote:
>> >>>>> Hello guys.
>> >>>>>
>> >>>>> I do many tests on xena and yoga. then i am sure that without
>> ha-queue and kombu_reconnect_delay=0.5(it can < 1)
>> >>>>> you cannot launch instances when 1 of 3 controllers is down.
>> >>>>> Somebody can verify what I say, I hope we will have a common
>> solution for this problem because those who use openstack for the first
>> time will continue to ask questions like that.
>> >>>>>
>> >>>>> Nguyen Huu Khoi
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Apr 13, 2023 at 12:59 AM Satish Patel <satish.txt at gmail.com>
>> wrote:
>> >>>>>> This is great! Matt,
>> >>>>>>
>> >>>>>> Documentation would be greatly appreciated. I have a counter
>> question: does Durable queue be good for large clouds with 1000 compute
>> nodes or better to not use durable queue. This is a private cloud and we
>> don't care about persistent data.
>> >>>>>>
>> >>>>>> On Wed, Apr 12, 2023 at 12:37 PM Radosław Piliszek <
>> radoslaw.piliszek at gmail.com> wrote:
>> >>>>>>> Hi Matt,
>> >>>>>>>
>> >>>>>>>> As you're now reconfiguring a running deployment, there are
>> >>>>>>>> some extra steps that need to be taken to migrate to durable
>> queues.
>> >>>>>>>>
>> >>>>>>>> 1. You will need to stop the OpenStack services which use
>> RabbitMQ.
>> >>>>>>>>
>> >>>>>>>> 2. Reset the state of RabbitMQ one each rabbit node with the
>> following
>> >>>>>>>> commands. You must run each command on all RabbitMQ nodes before
>> >>>>>>>> moving on to the next command. This will remove all queues.
>> >>>>>>>>
>> >>>>>>>>     rabbitmqctl stop_app
>> >>>>>>>>     rabbitmqctl force_reset
>> >>>>>>>>     rabbitmqctl start_app
>> >>>>>>>>
>> >>>>>>>> 3. Start the OpenStack services again, at which point they will
>> >>>>>>>> recreate the appropriate queues as durable.
>> >>>>>>> This sounds like a great new addition-to-be to the Kolla Ansible
>> docs!
>> >>>>>>> Could you please propose it as a change?
>> >>>>>>>
>> >>>>>>> Kindest,
>> >>>>>>> Radek
>> >>>>>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230424/a86635b1/attachment-0001.htm>


More information about the openstack-discuss mailing list