I’ve created a heat stack and installed Openstack Train to test the Centos7->8 upgrade following the document here:

 

https://docs.openstack.org/kolla-ansible/train/user/centos8.html#migrating-from-centos-7-to-centos-8

 

I used the instructions here to successfully remove and replace control0 with a Centos8 box

 

https://docs.openstack.org/kolla-ansible/train/user/adding-and-removing-hosts.html#removing-existing-controllers

 

After this my RMQ admin page shows all 3 nodes up, including the new control0. The name of the cluster is rabbit@chrnc-void-testupgrade-control-0.dev.chtrse.com

 

(rabbitmq)[root@chrnc-void-testupgrade-control-2 /]# rabbitmqctl cluster_status

Cluster status of node rabbit@chrnc-void-testupgrade-control-2 ...

[{nodes,[{disc,['rabbit@chrnc-void-testupgrade-control-0',

                'rabbit@chrnc-void-testupgrade-control-0-replace',

                'rabbit@chrnc-void-testupgrade-control-1',

                'rabbit@chrnc-void-testupgrade-control-2']}]},

{running_nodes,['rabbit@chrnc-void-testupgrade-control-0-replace',

                 'rabbit@chrnc-void-testupgrade-control-1',

                 'rabbit@chrnc-void-testupgrade-control-2']},

{cluster_name,<<"rabbit@chrnc-void-testupgrade-control-0.dev.chtrse.com">>},

{partitions,[]},

{alarms,[{'rabbit@chrnc-void-testupgrade-control-0-replace',[]},

          {'rabbit@chrnc-void-testupgrade-control-1',[]},

          {'rabbit@chrnc-void-testupgrade-control-2',[]}]}]

 

After that I create a new VM to verify that the cluster is still working, and then perform the same procedure on control1. When I shut down services on control1, the ansible playbook finishes successfully:

 

kolla-ansible -i ../multinode stop --yes-i-really-really-mean-it --limit control1

control1                   : ok=45   changed=22   unreachable=0    failed=0    skipped=105  rescued=0    ignored=0

 

After this my RMQ admin page stops responding. When I check RMQ on the new control0 and the existing control2, the container is still up but RMQ is not running:

 

(rabbitmq)[root@chrnc-void-testupgrade-control-0-replace /]# rabbitmqctl cluster_status

Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.

 

If I start it on control0 and control2, then the cluster seems normal and the admin page starts working again, and cluster status looks normal:

 

(rabbitmq)[root@chrnc-void-testupgrade-control-0-replace /]# rabbitmqctl cluster_status

Cluster status of node rabbit@chrnc-void-testupgrade-control-0-replace ...

[{nodes,[{disc,['rabbit@chrnc-void-testupgrade-control-0',

                'rabbit@chrnc-void-testupgrade-control-0-replace',

                'rabbit@chrnc-void-testupgrade-control-1',

                'rabbit@chrnc-void-testupgrade-control-2']}]},

{running_nodes,['rabbit@chrnc-void-testupgrade-control-2',

                 'rabbit@chrnc-void-testupgrade-control-0-replace']},

{cluster_name,<<"rabbit@chrnc-void-testupgrade-control-0.dev.chtrse.com">>},

{partitions,[]},

{alarms,[{'rabbit@chrnc-void-testupgrade-control-2',[]},

          {'rabbit@chrnc-void-testupgrade-control-0-replace',[]}]}]

 

But my hypervisors are down:

 

(openstack) [root@chrnc-void-testupgrade-build kolla-ansible]# ohll

+----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+

| ID | Hypervisor Hostname                             | Hypervisor Type | Host IP      | State | vCPUs Used | vCPUs | Memory MB Used | Memory MB |

+----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+

|  3 | chrnc-void-testupgrade-compute-2.dev.chtrse.com | QEMU            | 172.16.2.106 | down  |          5 |     8 |           2560 |     30719 |

|  6 | chrnc-void-testupgrade-compute-0.dev.chtrse.com | QEMU            | 172.16.2.31  | down  |          5 |     8 |           2560 |     30719 |

|  9 | chrnc-void-testupgrade-compute-1.dev.chtrse.com | QEMU            | 172.16.0.30  | down  |          5 |     8 |           2560 |     30719 |

+----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+

 

When I look at the nova-compute.log on a compute node, I see RMQ failures every 10 seconds:

 

172.16.2.31 compute0

2021-03-30 03:07:54.893 7 ERROR oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] AMQP server on 172.16.1.132:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out

2021-03-30 03:07:55.905 7 INFO oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] Reconnected to AMQP server on 172.16.1.132:5672 via [amqp] client with port 56422.

2021-03-30 03:08:05.915 7 ERROR oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] AMQP server on 172.16.1.132:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out

 

In the RMQ logs I see this every 10 seconds:

 

172.16.1.132 control2

[root@chrnc-void-testupgrade-control-2 ~]# tail -f /var/log/kolla/rabbitmq/rabbit\@chrnc-void-testupgrade-control-2.log |grep 172.16.2.31

2021-03-30 03:07:54.895 [warning] <0.13247.35> closing AMQP connection <0.13247.35> (172.16.2.31:56420 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e, vhost: '/', user: 'openstack'):

client unexpectedly closed TCP connection

2021-03-30 03:07:55.901 [info] <0.15288.35> accepting AMQP connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672)

2021-03-30 03:07:55.903 [info] <0.15288.35> Connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672) has a client-provided name: nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e

2021-03-30 03:07:55.904 [info] <0.15288.35> connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e): user 'openstack' authenticated and granted access to vhost '/'

2021-03-30 03:08:05.916 [warning] <0.15288.35> closing AMQP connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e, vhost: '/', user: 'openstack'):

 

Why does RMQ fail when I shut down the 2nd controller after successfully replacing the first one?

 

I apologize for the nonsense below. I have not been able to stop it from being attached to my external emails.

 

The contents of this e-mail message and
any attachments are intended solely for the
addressee(s) and may contain confidential
and/or legally privileged information. If you
are not the intended recipient of this message
or if this message has been addressed to you
in error, please immediately alert the sender
by reply e-mail and then delete this message
and any attachments. If you are not the
intended recipient, you are notified that
any use, dissemination, distribution, copying,
or storage of this message or any attachment
is strictly prohibited.