Re: [kolla][oslo] RabbitMQ issue Adding/Removing a Controller Node in an OpenStack Cluster
For everyone’s reference From: Ishan Shanware (ishanwar) <ishanwar@cisco.com> Date: Wednesday, 27 November 2024 at 10:01 PM To: Sven Kieske <kieske@osism.tech> Subject: Re: [kolla][oslo] RabbitMQ issue Adding/Removing a Controller Node in an OpenStack Cluster Hi Sven, Thanks for getting back to me. We have configured kolla to use 3 controllers in each cluster. Each controller has one rabbitmq container created by kolla. The process we follow move a controller out of rotation is as follows: 1. We first failover all the l3 and dhcp agents running on the controller to the 2 other controllers. 2. We then turn of the server and then delete any neutron agents on it. Currently the rabbitmq cluster only consists of only 2 containers. The problem specifically occurs during the reinstall of the rabbitmq container in the kolla deploy stage. Specifically, after it executes the checking rabbitmq containers role. It triggers a restart for the contains on all the blades. After this stage completed we observe that the masakari-engine disables all the nova-computes. Let me know if you need any other information for discussion Thanks Ishan From: Sven Kieske <kieske@osism.tech> Date: Tuesday, 26 November 2024 at 11:36 PM To: Ishan Shanware (ishanwar) <ishanwar@cisco.com>, openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org> Subject: Re: [kolla][oslo] RabbitMQ issue Adding/Removing a Controller Node in an OpenStack Cluster Hi Ishan, my first question would be, how many controller nodes and specifically, rabbitmq nodes you are running inside your Openstack cluster? You should be following the general guidelines for running any raft consensus based distributed system and only run an odd number of systems, e.g. 3 or 5 control nodes. Can you confirm that this is the case in your setup? If you e.g. run a two node setup such errors are indeed expected. See also our production architecture guide: https://docs.openstack.org/kolla-ansible/latest/admin/production-architectur...
Control - Cloud controller nodes which host control services like APIs and databases. This group should have odd number of nodes for quorum.
If you are running an odd number of control nodes and you're still facing this issue, I would be curious to know the rabbitmq cluster state before you add or remove a node, because this should theoretically just work. But maybe there is another issue with your cluster. HTH -- Sven Kieske Senior Cloud Engineer Mail: kieske@osism.tech Web: https://osism.tech OSISM GmbH / Talweg 8 / 75417 Mühlacker / Deutschland Geschäftsführer: Christian Berendt Unternehmenssitz: Mühlacker Amtsgericht: Stuttgart, HRB 756139
Hi Sven, Thanks again, for following up on my message. If you need any other information on the issue for our discussion. Please let me know. Thanks, Ishan From: Ishan Shanware (ishanwar) <ishanwar@cisco.com> Date: Wednesday, 27 November 2024 at 10:03 PM To: Ishan Shanware (ishanwar) <ishanwar@cisco.com>, Subject: Re: [kolla][oslo] RabbitMQ issue Adding/Removing a Controller Node in an OpenStack Cluster For everyone’s reference From: Ishan Shanware (ishanwar) <ishanwar@cisco.com> Date: Wednesday, 27 November 2024 at 10:01 PM To: Sven Kieske <kieske@osism.tech> Subject: Re: [kolla][oslo] RabbitMQ issue Adding/Removing a Controller Node in an OpenStack Cluster Hi Sven, Thanks for getting back to me. We have configured kolla to use 3 controllers in each cluster. Each controller has one rabbitmq container created by kolla. The process we follow move a controller out of rotation is as follows: 1. We first failover all the l3 and dhcp agents running on the controller to the 2 other controllers. 2. We then turn of the server and then delete any neutron agents on it. Currently the rabbitmq cluster only consists of only 2 containers. The problem specifically occurs during the reinstall of the rabbitmq container in the kolla deploy stage. Specifically, after it executes the checking rabbitmq containers role. It triggers a restart for the contains on all the blades. After this stage completed we observe that the masakari-engine disables all the nova-computes. Let me know if you need any other information for discussion Thanks Ishan From: Sven Kieske <kieske@osism.tech> Date: Tuesday, 26 November 2024 at 11:36 PM To: Ishan Shanware (ishanwar) <ishanwar@cisco.com>, openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org> Subject: Re: [kolla][oslo] RabbitMQ issue Adding/Removing a Controller Node in an OpenStack Cluster Hi Ishan, my first question would be, how many controller nodes and specifically, rabbitmq nodes you are running inside your Openstack cluster? You should be following the general guidelines for running any raft consensus based distributed system and only run an odd number of systems, e.g. 3 or 5 control nodes. Can you confirm that this is the case in your setup? If you e.g. run a two node setup such errors are indeed expected. See also our production architecture guide: https://docs.openstack.org/kolla-ansible/latest/admin/production-architectur...
Control - Cloud controller nodes which host control services like APIs and databases. This group should have odd number of nodes for quorum.
If you are running an odd number of control nodes and you're still facing this issue, I would be curious to know the rabbitmq cluster state before you add or remove a node, because this should theoretically just work. But maybe there is another issue with your cluster. HTH -- Sven Kieske Senior Cloud Engineer Mail: kieske@osism.tech Web: https://osism.tech OSISM GmbH / Talweg 8 / 75417 Mühlacker / Deutschland Geschäftsführer: Christian Berendt Unternehmenssitz: Mühlacker Amtsgericht: Stuttgart, HRB 756139
participants (1)
-
Ishan Shanware (ishanwar)