<div dir="ltr">Hi everyone,<div>I update that. With failover controller, we dont need use mirror queue if we use ceph as backend. In my case, I use SAN as cinder backend and NFS as glance backend so I need mirror queue. It is quite weird.<br><div><div><div dir="ltr" data-smartmail="gmail_signature"><div dir="ltr">Nguyen Huu Khoi<br></div></div></div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 17, 2023 at 5:03 PM Doug Szumski <<a href="mailto:doug@stackhpc.com" target="_blank">doug@stackhpc.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 13/04/2023 23:04, Satish Patel wrote:<br>

> Thank Sean/Matt,<br>

><br>

> This is interesting that I have only option to use classic mirror with durable :( because without mirror cluster acting up when one of node is down.<br>

<br>

If you have planned maintenance and non-mirrored transient queues, you <br>

can first try draining the node to be removed, before removing it from <br>

the cluster. In my testing at least, this appears to be much more <br>

successful than relying on the RabbitMQ clients to do the failover and <br>

recreate queues.  See [1], or for RMQ <3.8 you can cobble something <br>

together with ha-mode nodes [2].<br>

<br>

[1] <a href="https://www.rabbitmq.com/upgrade.html#maintenance-mode" rel="noreferrer" target="_blank">https://www.rabbitmq.com/upgrade.html#maintenance-mode</a><br>

<br>

[2] <a href="https://www.rabbitmq.com/ha.html#mirroring-arguments" rel="noreferrer" target="_blank">https://www.rabbitmq.com/ha.html#mirroring-arguments</a><br>

<br>

This obviously doesn't solve the case of when a controller fails <br>

unexpectedly.<br>

<br>

I also think it's worth making the distinction between a highly <br>

available messaging infrastructure, and queue mirroring. In many cases, <br>

if a RMQ node hosting a non-mirrored, transient queue goes down, it <br>

should be possible for a service to just recreate the queue on another <br>

node and retry. This often seems to fail, which leads to queue mirroring <br>

getting turned on.<br>

<br>

><br>

> How people are scaling rabbitMQ for large scale?<br>

><br>

> Sent from my iPhone<br>

><br>

>> On Apr 13, 2023, at 7:50 AM, Sean Mooney <<a href="mailto:smooney@redhat.com" target="_blank">smooney@redhat.com</a>> wrote:<br>

>><br>

>> On Thu, 2023-04-13 at 09:07 +0100, Matt Crees wrote:<br>

>>> Hi all, I'll reply in turn here:<br>

>>><br>

>>> Radek, I agree it definitely will be a good addition to the KA docs.<br>

>>> I've got it on my radar, will aim to get a patch proposed this week.<br>

>>><br>

>>> Satish, I haven't personally been able to test durable queues on a<br>

>>> system that large. According to the RabbitMQ docs<br>

>>> (<a href="https://www.rabbitmq.com/queues.html#durability" rel="noreferrer" target="_blank">https://www.rabbitmq.com/queues.html#durability</a>), "Throughput and<br>

>>> latency of a queue is not affected by whether a queue is durable or<br>

>>> not in most cases." However, I have anecdotally heard that it can<br>

>>> affect some performance in particularly large systems.<br>

>> the perfomacne of durable queue is dominated by the disk io<br>

>> put rabbit on a pcie nvme ssd and it will have littel effect<br>

>> use spinnign rust (a HDD even in raid 10) and the iops/throput of the<br>

>> storage sued to make the qure durible (that just mean write all messages to disk)<br>

>> will be the bottlenck for scaleablity<br>

>> combine that with HA and that is worse as it has to be writen to multipel servers.<br>

>> i belive that teh mirror implemantion will wait for all copies to be persisted<br>

>> but i have not really looked into it. it was raised as a pain point with rabbit by<br>

>> operator in the past in terms fo scaling.<br>

>><br>

>>> Please note that if you are using the classic mirrored queues, you<br>

>>> must also have them durable. Transient (i.e. non-durable) mirrored<br>

>>> queues are not a supported feature and do cause bugs. (For example the<br>

>>> "old incarnation" errors seen here:<br>

>>> <a href="https://bugs.launchpad.net/kolla-ansible/+bug/1954925" rel="noreferrer" target="_blank">https://bugs.launchpad.net/kolla-ansible/+bug/1954925</a>)<br>

>>><br>

>>> Nguyen, I can confirm that we've seen the same behaviour. This is<br>

>>> caused by a backported change in oslo.messaging (I believe you linked<br>

>>> the relevant bug report previously). There is a fix in progress to fix<br>

>>> this (<a href="https://review.opendev.org/c/openstack/oslo.messaging/+/866617" rel="noreferrer" target="_blank">https://review.opendev.org/c/openstack/oslo.messaging/+/866617</a>),<br>

>>> and in the meantime setting kombu_reconnect_delay < 1.0 does resolve<br>

>>> it.<br>

>>><br>

>>> Cheers,<br>

>>> Matt<br>

>>><br>

>>>> On Thu, 13 Apr 2023 at 03:04, Nguyễn Hữu Khôi <<a href="mailto:nguyenhuukhoinw@gmail.com" target="_blank">nguyenhuukhoinw@gmail.com</a>> wrote:<br>

>>>><br>

>>>> update:<br>

>>>> I use SAN as Cinder backend.<br>

>>>> Nguyen Huu Khoi<br>

>>>><br>

>>>><br>

>>>> On Thu, Apr 13, 2023 at 9:02 AM Nguyễn Hữu Khôi <<a href="mailto:nguyenhuukhoinw@gmail.com" target="_blank">nguyenhuukhoinw@gmail.com</a>> wrote:<br>

>>>>> Hello guys.<br>

>>>>><br>

>>>>> I do many tests on xena and yoga. then i am sure that without ha-queue and kombu_reconnect_delay=0.5(it can < 1)<br>

>>>>> you cannot launch instances when 1 of 3 controllers is down.<br>

>>>>> Somebody can verify what I say, I hope we will have a common solution for this problem because those who use openstack for the first time will continue to ask questions like that.<br>

>>>>><br>

>>>>> Nguyen Huu Khoi<br>

>>>>><br>

>>>>><br>

>>>>> On Thu, Apr 13, 2023 at 12:59 AM Satish Patel <<a href="mailto:satish.txt@gmail.com" target="_blank">satish.txt@gmail.com</a>> wrote:<br>

>>>>>> This is great! Matt,<br>

>>>>>><br>

>>>>>> Documentation would be greatly appreciated. I have a counter question: does Durable queue be good for large clouds with 1000 compute nodes or better to not use durable queue. This is a private cloud and we don't care about persistent data.<br>

>>>>>><br>

>>>>>> On Wed, Apr 12, 2023 at 12:37 PM Radosław Piliszek <<a href="mailto:radoslaw.piliszek@gmail.com" target="_blank">radoslaw.piliszek@gmail.com</a>> wrote:<br>

>>>>>>> Hi Matt,<br>

>>>>>>><br>

>>>>>>>> As you're now reconfiguring a running deployment, there are<br>

>>>>>>>> some extra steps that need to be taken to migrate to durable queues.<br>

>>>>>>>><br>

>>>>>>>> 1. You will need to stop the OpenStack services which use RabbitMQ.<br>

>>>>>>>><br>

>>>>>>>> 2. Reset the state of RabbitMQ one each rabbit node with the following<br>

>>>>>>>> commands. You must run each command on all RabbitMQ nodes before<br>

>>>>>>>> moving on to the next command. This will remove all queues.<br>

>>>>>>>><br>

>>>>>>>>     rabbitmqctl stop_app<br>

>>>>>>>>     rabbitmqctl force_reset<br>

>>>>>>>>     rabbitmqctl start_app<br>

>>>>>>>><br>

>>>>>>>> 3. Start the OpenStack services again, at which point they will<br>

>>>>>>>> recreate the appropriate queues as durable.<br>

>>>>>>> This sounds like a great new addition-to-be to the Kolla Ansible docs!<br>

>>>>>>> Could you please propose it as a change?<br>

>>>>>>><br>

>>>>>>> Kindest,<br>

>>>>>>> Radek<br>

>>>>>>><br>

</blockquote></div>