[openstack][sharing][kolla ansible]Problems when 1 of 3 controller was be down

Matt Crees mattc at stackhpc.com
Thu Apr 13 08:07:19 UTC 2023


Hi all, I'll reply in turn here:

Radek, I agree it definitely will be a good addition to the KA docs.
I've got it on my radar, will aim to get a patch proposed this week.

Satish, I haven't personally been able to test durable queues on a
system that large. According to the RabbitMQ docs
(https://www.rabbitmq.com/queues.html#durability), "Throughput and
latency of a queue is not affected by whether a queue is durable or
not in most cases." However, I have anecdotally heard that it can
affect some performance in particularly large systems.

Please note that if you are using the classic mirrored queues, you
must also have them durable. Transient (i.e. non-durable) mirrored
queues are not a supported feature and do cause bugs. (For example the
"old incarnation" errors seen here:
https://bugs.launchpad.net/kolla-ansible/+bug/1954925)

Nguyen, I can confirm that we've seen the same behaviour. This is
caused by a backported change in oslo.messaging (I believe you linked
the relevant bug report previously). There is a fix in progress to fix
this (https://review.opendev.org/c/openstack/oslo.messaging/+/866617),
and in the meantime setting kombu_reconnect_delay < 1.0 does resolve
it.

Cheers,
Matt

On Thu, 13 Apr 2023 at 03:04, Nguyễn Hữu Khôi <nguyenhuukhoinw at gmail.com> wrote:
>
> update:
> I use SAN as Cinder backend.
> Nguyen Huu Khoi
>
>
> On Thu, Apr 13, 2023 at 9:02 AM Nguyễn Hữu Khôi <nguyenhuukhoinw at gmail.com> wrote:
>>
>> Hello guys.
>>
>> I do many tests on xena and yoga. then i am sure that without ha-queue and kombu_reconnect_delay=0.5(it can < 1)
>> you cannot launch instances when 1 of 3 controllers is down.
>> Somebody can verify what I say, I hope we will have a common solution for this problem because those who use openstack for the first time will continue to ask questions like that.
>>
>> Nguyen Huu Khoi
>>
>>
>> On Thu, Apr 13, 2023 at 12:59 AM Satish Patel <satish.txt at gmail.com> wrote:
>>>
>>> This is great! Matt,
>>>
>>> Documentation would be greatly appreciated. I have a counter question: does Durable queue be good for large clouds with 1000 compute nodes or better to not use durable queue. This is a private cloud and we don't care about persistent data.
>>>
>>> On Wed, Apr 12, 2023 at 12:37 PM Radosław Piliszek <radoslaw.piliszek at gmail.com> wrote:
>>>>
>>>> Hi Matt,
>>>>
>>>> > As you're now reconfiguring a running deployment, there are
>>>> > some extra steps that need to be taken to migrate to durable queues.
>>>> >
>>>> > 1. You will need to stop the OpenStack services which use RabbitMQ.
>>>> >
>>>> > 2. Reset the state of RabbitMQ one each rabbit node with the following
>>>> > commands. You must run each command on all RabbitMQ nodes before
>>>> > moving on to the next command. This will remove all queues.
>>>> >
>>>> >     rabbitmqctl stop_app
>>>> >     rabbitmqctl force_reset
>>>> >     rabbitmqctl start_app
>>>> >
>>>> > 3. Start the OpenStack services again, at which point they will
>>>> > recreate the appropriate queues as durable.
>>>>
>>>> This sounds like a great new addition-to-be to the Kolla Ansible docs!
>>>> Could you please propose it as a change?
>>>>
>>>> Kindest,
>>>> Radek
>>>>



More information about the openstack-discuss mailing list