[kolla] RabbitMQ issue Adding/Removing a Controller Node in an OpenStack Cluster
Hi Kolla Community, I would like to understand the current guidelines for adding or removing a controller node from an OpenStack cluster. With the introduction of quorum queues, we’ve observed that adding a new controller node causes a loss of quorum within the cluster. Consequently, during the reinstallation of a controller node, Masakari disables all the nova-compute services. Is this behaviour expected with quorum queues, or could this be addressed differently? Additionally, I’ve noticed that as soon as the Kolla containers are restarted in the cluster, clients are unable to connect, causing the cluster to go down. Could you please share insights or recommendations on how to handle this scenario effectively? Looking forward to hearing your thoughts. Thanks, Ishan Cisco Systems
Hi Stackers, Requesting your assistance with this issue. Thanks in advance, Ishan From: Ishan Shanware (ishanwar) <ishanwar@cisco.com> Date: Thursday, 21 November 2024 at 10:26 PM To: openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org> Subject: [kolla] RabbitMQ issue Adding/Removing a Controller Node in an OpenStack Cluster Hi Kolla Community, I would like to understand the current guidelines for adding or removing a controller node from an OpenStack cluster. With the introduction of quorum queues, we’ve observed that adding a new controller node causes a loss of quorum within the cluster. Consequently, during the reinstallation of a controller node, Masakari disables all the nova-compute services. Is this behaviour expected with quorum queues, or could this be addressed differently? Additionally, I’ve noticed that as soon as the Kolla containers are restarted in the cluster, clients are unable to connect, causing the cluster to go down. Could you please share insights or recommendations on how to handle this scenario effectively? Looking forward to hearing your thoughts. Thanks, Ishan Cisco Systems
Hi Ishan, my first question would be, how many controller nodes and specifically, rabbitmq nodes you are running inside your Openstack cluster? You should be following the general guidelines for running any raft consensus based distributed system and only run an odd number of systems, e.g. 3 or 5 control nodes. Can you confirm that this is the case in your setup? If you e.g. run a two node setup such errors are indeed expected. See also our production architecture guide: https://docs.openstack.org/kolla-ansible/latest/admin/production-architectur...
Control - Cloud controller nodes which host control services like APIs and databases. This group should have odd number of nodes for quorum.
If you are running an odd number of control nodes and you're still facing this issue, I would be curious to know the rabbitmq cluster state before you add or remove a node, because this should theoretically just work. But maybe there is another issue with your cluster. HTH -- Sven Kieske Senior Cloud Engineer Mail: kieske@osism.tech Web: https://osism.tech OSISM GmbH / Talweg 8 / 75417 Mühlacker / Deutschland Geschäftsführer: Christian Berendt Unternehmenssitz: Mühlacker Amtsgericht: Stuttgart, HRB 756139
participants (2)
-
Ishan Shanware (ishanwar)
-
Sven Kieske