Re: [openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down

24 Apr 2023


      Oh wait! What is the relation of Rabbit mirror queue with Ceph/NFS or any
shared storage backend?

On Mon, Apr 24, 2023 at 2:17 AM Nguyễn Hữu Khôi <nguyenhuukhoinw@gmail.com>
wrote:
...
Hi everyone,
I update that. With failover controller, we dont need use mirror queue if
we use ceph as backend. In my case, I use SAN as cinder backend and NFS as
glance backend so I need mirror queue. It is quite weird.
Nguyen Huu Khoi
On Mon, Apr 17, 2023 at 5:03 PM Doug Szumski <doug@stackhpc.com> wrote:
...
On 13/04/2023 23:04, Satish Patel wrote:
...
Thank Sean/Matt,
This is interesting that I have only option to use classic mirror with
durable :( because without mirror cluster acting up when one of node is
down.
If you have planned maintenance and non-mirrored transient queues, you
can first try draining the node to be removed, before removing it from
the cluster. In my testing at least, this appears to be much more
successful than relying on the RabbitMQ clients to do the failover and
recreate queues.  See [1], or for RMQ <3.8 you can cobble something
together with ha-mode nodes [2].
[1] https://www.rabbitmq.com/upgrade.html#maintenance-mode
[2] https://www.rabbitmq.com/ha.html#mirroring-arguments
This obviously doesn't solve the case of when a controller fails
unexpectedly.
I also think it's worth making the distinction between a highly
available messaging infrastructure, and queue mirroring. In many cases,
if a RMQ node hosting a non-mirrored, transient queue goes down, it
should be possible for a service to just recreate the queue on another
node and retry. This often seems to fail, which leads to queue mirroring
getting turned on.
...
How people are scaling rabbitMQ for large scale?
Sent from my iPhone
...
On Apr 13, 2023, at 7:50 AM, Sean Mooney <smooney@redhat.com> wrote:
...
Hi all, I'll reply in turn here:
Radek, I agree it definitely will be a good addition to the KA docs.
I've got it on my radar, will aim to get a patch proposed this week.
Satish, I haven't personally been able to test durable queues on a
system that large. According to the RabbitMQ docs
(https://www.rabbitmq.com/queues.html#durability), "Throughput and
latency of a queue is not affected by whether a queue is durable or
not in most cases." However, I have anecdotally heard that it can
affect some performance in particularly large systems.
On Thu, 2023-04-13 at 09:07 +0100, Matt Crees wrote:
the perfomacne of durable queue is dominated by the disk io
put rabbit on a pcie nvme ssd and it will have littel effect
use spinnign rust (a HDD even in raid 10) and the iops/throput of the
storage sued to make the qure durible (that just mean write all
...
...
will be the bottlenck for scaleablity
combine that with HA and that is worse as it has to be writen to
multipel servers.
i belive that teh mirror implemantion will wait for all copies to be
messages to disk)
persisted
...
...
but i have not really looked into it. it was raised as a pain point
with rabbit by
operator in the past in terms fo scaling.
...
Please note that if you are using the classic mirrored queues, you
must also have them durable. Transient (i.e. non-durable) mirrored
queues are not a supported feature and do cause bugs. (For example the
"old incarnation" errors seen here:
https://bugs.launchpad.net/kolla-ansible/+bug/1954925)
Nguyen, I can confirm that we've seen the same behaviour. This is
caused by a backported change in oslo.messaging (I believe you linked
the relevant bug report previously). There is a fix in progress to fix
this (https://review.opendev.org/c/openstack/oslo.messaging/+/866617
),
and in the meantime setting kombu_reconnect_delay < 1.0 does resolve
it.
Cheers,
Matt
...
On Thu, 13 Apr 2023 at 03:04, Nguyễn Hữu Khôi <
nguyenhuukhoinw@gmail.com> wrote:
update:
I use SAN as Cinder backend.
Nguyen Huu Khoi
On Thu, Apr 13, 2023 at 9:02 AM Nguyễn Hữu Khôi <
nguyenhuukhoinw@gmail.com> wrote:
> Hello guys.
>
> I do many tests on xena and yoga. then i am sure that without
ha-queue and kombu_reconnect_delay=0.5(it can < 1)
> you cannot launch instances when 1 of 3 controllers is down.
> Somebody can verify what I say, I hope we will have a common
solution for this problem because those who use openstack for the first
time will continue to ask questions like that.
>
> Nguyen Huu Khoi
>
>
> On Thu, Apr 13, 2023 at 12:59 AM Satish Patel <satish.txt@gmail.com>
wrote:
>> This is great! Matt,
>>
>> Documentation would be greatly appreciated. I have a counter
question: does Durable queue be good for large clouds with 1000 compute
nodes or better to not use durable queue. This is a private cloud and we
don't care about persistent data.
>>
>> On Wed, Apr 12, 2023 at 12:37 PM Radosław Piliszek <
radoslaw.piliszek@gmail.com> wrote:
>>> Hi Matt,
>>>
>>>> As you're now reconfiguring a running deployment, there are
>>>> some extra steps that need to be taken to migrate to durable
queues.
>>>>
>>>> 1. You will need to stop the OpenStack services which use
RabbitMQ.
>>>>
>>>> 2. Reset the state of RabbitMQ one each rabbit node with the
following
>>>> commands. You must run each command on all RabbitMQ nodes before
>>>> moving on to the next command. This will remove all queues.
>>>>
>>>>     rabbitmqctl stop_app
>>>>     rabbitmqctl force_reset
>>>>     rabbitmqctl start_app
>>>>
>>>> 3. Start the OpenStack services again, at which point they will
>>>> recreate the appropriate queues as durable.
>>> This sounds like a great new addition-to-be to the Kolla Ansible
docs!
>>> Could you please propose it as a change?
>>>
>>> Kindest,
>>> Radek
>>>