New subject: [kolla][oslo] RabbitMQ issue Adding/Removing a Controller Node in an OpenStack Cluster

27 Nov 2024

      For everyone’s reference

From: Ishan Shanware (ishanwar) <ishanwar@cisco.com>
Date: Wednesday, 27 November 2024 at 10:01 PM
To: Sven Kieske <kieske@osism.tech>
Subject: Re: [kolla][oslo] RabbitMQ issue Adding/Removing a Controller Node in an OpenStack Cluster
Hi Sven,

Thanks for getting back to me.  We have configured kolla to use 3 controllers in each cluster. Each controller has one rabbitmq container created by kolla.

The process we follow move a controller out of rotation is as follows:

1. We first failover all the l3 and dhcp agents running on the controller to the 2 other controllers.
2. We then turn of the server and then delete any neutron agents on it.

Currently the rabbitmq cluster only consists of only 2 containers. The problem specifically occurs during the reinstall of the rabbitmq container in the kolla deploy stage.  Specifically, after it executes the checking rabbitmq containers role. It triggers a restart for the contains on all the blades.

After this stage completed we observe that the masakari-engine disables all the nova-computes. Let me know if you need any other information for discussion

Thanks
Ishan

From: Sven Kieske <kieske@osism.tech>
Date: Tuesday, 26 November 2024 at 11:36 PM
To: Ishan Shanware (ishanwar) <ishanwar@cisco.com>, openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org>
Subject: Re: [kolla][oslo] RabbitMQ issue Adding/Removing a Controller Node in an OpenStack Cluster
Hi Ishan,

my first question would be, how many controller nodes and specifically,
rabbitmq nodes you are running inside your Openstack cluster?

You should be following the general guidelines for running any raft
consensus based distributed system and only run an odd number of
systems, e.g. 3 or 5 control nodes.

Can you confirm that this is the case in your setup?
If you e.g. run a two node setup such errors are indeed expected.

See also our production architecture guide:

https://docs.openstack.org/kolla-ansible/latest/admin/production-architectur...
...
Control - Cloud controller nodes which host control services like
APIs and databases. This group should have odd number of nodes for
quorum.
If you are running an odd number of control nodes and you're still
facing this issue, I would be curious to know the rabbitmq cluster
state before you add or remove a node, because this should
theoretically just work. But maybe there is another issue with your
cluster.

HTH

--
Sven Kieske
Senior Cloud Engineer

Mail: kieske@osism.tech
Web: https://osism.tech

OSISM GmbH / Talweg 8 / 75417 Mühlacker / Deutschland

Geschäftsführer: Christian Berendt
Unternehmenssitz: Mühlacker
Amtsgericht: Stuttgart, HRB 756139

Re: [kolla][oslo] RabbitMQ issue Adding/Removing a Controller Node in an OpenStack Cluster

Ishan Shanware (ishanwar)

Ishan Shanware (ishanwar)

tags

participants (1)