[openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down

Satish Patel satish.txt at gmail.com
Thu Apr 13 22:04:56 UTC 2023


Thank Sean/Matt,

This is interesting that I have only option to use classic mirror with durable :( because without mirror cluster acting up when one of node is down. 

How people are scaling rabbitMQ for large scale? 

Sent from my iPhone

> On Apr 13, 2023, at 7:50 AM, Sean Mooney <smooney at redhat.com> wrote:
> 
> On Thu, 2023-04-13 at 09:07 +0100, Matt Crees wrote:
>> Hi all, I'll reply in turn here:
>> 
>> Radek, I agree it definitely will be a good addition to the KA docs.
>> I've got it on my radar, will aim to get a patch proposed this week.
>> 
>> Satish, I haven't personally been able to test durable queues on a
>> system that large. According to the RabbitMQ docs
>> (https://www.rabbitmq.com/queues.html#durability), "Throughput and
>> latency of a queue is not affected by whether a queue is durable or
>> not in most cases." However, I have anecdotally heard that it can
>> affect some performance in particularly large systems.
> 
> the perfomacne of durable queue is dominated by the disk io
> put rabbit on a pcie nvme ssd and it will have littel effect
> use spinnign rust (a HDD even in raid 10) and the iops/throput of the
> storage sued to make the qure durible (that just mean write all messages to disk)
> will be the bottlenck for scaleablity
> combine that with HA and that is worse as it has to be writen to multipel servers.
> i belive that teh mirror implemantion will wait for all copies to be persisted
> but i have not really looked into it. it was raised as a pain point with rabbit by
> operator in the past in terms fo scaling.
> 
>> 
>> Please note that if you are using the classic mirrored queues, you
>> must also have them durable. Transient (i.e. non-durable) mirrored
>> queues are not a supported feature and do cause bugs. (For example the
>> "old incarnation" errors seen here:
>> https://bugs.launchpad.net/kolla-ansible/+bug/1954925)
>> 
>> Nguyen, I can confirm that we've seen the same behaviour. This is
>> caused by a backported change in oslo.messaging (I believe you linked
>> the relevant bug report previously). There is a fix in progress to fix
>> this (https://review.opendev.org/c/openstack/oslo.messaging/+/866617),
>> and in the meantime setting kombu_reconnect_delay < 1.0 does resolve
>> it.
>> 
>> Cheers,
>> Matt
>> 
>>> On Thu, 13 Apr 2023 at 03:04, Nguyễn Hữu Khôi <nguyenhuukhoinw at gmail.com> wrote:
>>> 
>>> update:
>>> I use SAN as Cinder backend.
>>> Nguyen Huu Khoi
>>> 
>>> 
>>> On Thu, Apr 13, 2023 at 9:02 AM Nguyễn Hữu Khôi <nguyenhuukhoinw at gmail.com> wrote:
>>>> 
>>>> Hello guys.
>>>> 
>>>> I do many tests on xena and yoga. then i am sure that without ha-queue and kombu_reconnect_delay=0.5(it can < 1)
>>>> you cannot launch instances when 1 of 3 controllers is down.
>>>> Somebody can verify what I say, I hope we will have a common solution for this problem because those who use openstack for the first time will continue to ask questions like that.
>>>> 
>>>> Nguyen Huu Khoi
>>>> 
>>>> 
>>>> On Thu, Apr 13, 2023 at 12:59 AM Satish Patel <satish.txt at gmail.com> wrote:
>>>>> 
>>>>> This is great! Matt,
>>>>> 
>>>>> Documentation would be greatly appreciated. I have a counter question: does Durable queue be good for large clouds with 1000 compute nodes or better to not use durable queue. This is a private cloud and we don't care about persistent data.
>>>>> 
>>>>> On Wed, Apr 12, 2023 at 12:37 PM Radosław Piliszek <radoslaw.piliszek at gmail.com> wrote:
>>>>>> 
>>>>>> Hi Matt,
>>>>>> 
>>>>>>> As you're now reconfiguring a running deployment, there are
>>>>>>> some extra steps that need to be taken to migrate to durable queues.
>>>>>>> 
>>>>>>> 1. You will need to stop the OpenStack services which use RabbitMQ.
>>>>>>> 
>>>>>>> 2. Reset the state of RabbitMQ one each rabbit node with the following
>>>>>>> commands. You must run each command on all RabbitMQ nodes before
>>>>>>> moving on to the next command. This will remove all queues.
>>>>>>> 
>>>>>>>    rabbitmqctl stop_app
>>>>>>>    rabbitmqctl force_reset
>>>>>>>    rabbitmqctl start_app
>>>>>>> 
>>>>>>> 3. Start the OpenStack services again, at which point they will
>>>>>>> recreate the appropriate queues as durable.
>>>>>> 
>>>>>> This sounds like a great new addition-to-be to the Kolla Ansible docs!
>>>>>> Could you please propose it as a change?
>>>>>> 
>>>>>> Kindest,
>>>>>> Radek
>>>>>> 
>> 
> 



More information about the openstack-discuss mailing list