[kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller

Braden, Albert C-Albert.Braden at charter.com
Tue Mar 30 12:40:39 UTC 2021


I've created a heat stack and installed Openstack Train to test the Centos7->8 upgrade following the document here:

https://docs.openstack.org/kolla-ansible/train/user/centos8.html#migrating-from-centos-7-to-centos-8

I used the instructions here to successfully remove and replace control0 with a Centos8 box

https://docs.openstack.org/kolla-ansible/train/user/adding-and-removing-hosts.html#removing-existing-controllers

After this my RMQ admin page shows all 3 nodes up, including the new control0. The name of the cluster is rabbit at chrnc-void-testupgrade-control-0.dev.chtrse.com<mailto:rabbit at chrnc-void-testupgrade-control-0.dev.chtrse.com>

(rabbitmq)[root at chrnc-void-testupgrade-control-2 /]# rabbitmqctl cluster_status
Cluster status of node rabbit at chrnc-void-testupgrade-control-2 ...
[{nodes,[{disc,['rabbit at chrnc-void-testupgrade-control-0',
                'rabbit at chrnc-void-testupgrade-control-0-replace',
                'rabbit at chrnc-void-testupgrade-control-1',
                'rabbit at chrnc-void-testupgrade-control-2']}]},
{running_nodes,['rabbit at chrnc-void-testupgrade-control-0-replace',
                 'rabbit at chrnc-void-testupgrade-control-1',
                 'rabbit at chrnc-void-testupgrade-control-2']},
{cluster_name,<<"rabbit at chrnc-void-testupgrade-control-0.dev.chtrse.com">>},
{partitions,[]},
{alarms,[{'rabbit at chrnc-void-testupgrade-control-0-replace',[]},
          {'rabbit at chrnc-void-testupgrade-control-1',[]},
          {'rabbit at chrnc-void-testupgrade-control-2',[]}]}]

After that I create a new VM to verify that the cluster is still working, and then perform the same procedure on control1. When I shut down services on control1, the ansible playbook finishes successfully:

kolla-ansible -i ../multinode stop --yes-i-really-really-mean-it --limit control1
...
control1                   : ok=45   changed=22   unreachable=0    failed=0    skipped=105  rescued=0    ignored=0

After this my RMQ admin page stops responding. When I check RMQ on the new control0 and the existing control2, the container is still up but RMQ is not running:

(rabbitmq)[root at chrnc-void-testupgrade-control-0-replace /]# rabbitmqctl cluster_status
Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.

If I start it on control0 and control2, then the cluster seems normal and the admin page starts working again, and cluster status looks normal:

(rabbitmq)[root at chrnc-void-testupgrade-control-0-replace /]# rabbitmqctl cluster_status
Cluster status of node rabbit at chrnc-void-testupgrade-control-0-replace ...
[{nodes,[{disc,['rabbit at chrnc-void-testupgrade-control-0',
                'rabbit at chrnc-void-testupgrade-control-0-replace',
                'rabbit at chrnc-void-testupgrade-control-1',
                'rabbit at chrnc-void-testupgrade-control-2']}]},
{running_nodes,['rabbit at chrnc-void-testupgrade-control-2',
                 'rabbit at chrnc-void-testupgrade-control-0-replace']},
{cluster_name,<<"rabbit at chrnc-void-testupgrade-control-0.dev.chtrse.com">>},
{partitions,[]},
{alarms,[{'rabbit at chrnc-void-testupgrade-control-2',[]},
          {'rabbit at chrnc-void-testupgrade-control-0-replace',[]}]}]

But my hypervisors are down:

(openstack) [root at chrnc-void-testupgrade-build kolla-ansible]# ohll
+----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+
| ID | Hypervisor Hostname                             | Hypervisor Type | Host IP      | State | vCPUs Used | vCPUs | Memory MB Used | Memory MB |
+----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+
|  3 | chrnc-void-testupgrade-compute-2.dev.chtrse.com | QEMU            | 172.16.2.106 | down  |          5 |     8 |           2560 |     30719 |
|  6 | chrnc-void-testupgrade-compute-0.dev.chtrse.com | QEMU            | 172.16.2.31  | down  |          5 |     8 |           2560 |     30719 |
|  9 | chrnc-void-testupgrade-compute-1.dev.chtrse.com | QEMU            | 172.16.0.30  | down  |          5 |     8 |           2560 |     30719 |
+----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+

When I look at the nova-compute.log on a compute node, I see RMQ failures every 10 seconds:

172.16.2.31 compute0
2021-03-30 03:07:54.893 7 ERROR oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] AMQP server on 172.16.1.132:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
2021-03-30 03:07:55.905 7 INFO oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] Reconnected to AMQP server on 172.16.1.132:5672 via [amqp] client with port 56422.
2021-03-30 03:08:05.915 7 ERROR oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] AMQP server on 172.16.1.132:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out

In the RMQ logs I see this every 10 seconds:

172.16.1.132 control2
[root at chrnc-void-testupgrade-control-2 ~]# tail -f /var/log/kolla/rabbitmq/rabbit\@chrnc-void-testupgrade-control-2.log |grep 172.16.2.31
2021-03-30 03:07:54.895 [warning] <0.13247.35> closing AMQP connection <0.13247.35> (172.16.2.31:56420 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e, vhost: '/', user: 'openstack'):
client unexpectedly closed TCP connection
2021-03-30 03:07:55.901 [info] <0.15288.35> accepting AMQP connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672)
2021-03-30 03:07:55.903 [info] <0.15288.35> Connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672) has a client-provided name: nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e
2021-03-30 03:07:55.904 [info] <0.15288.35> connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e): user 'openstack' authenticated and granted access to vhost '/'
2021-03-30 03:08:05.916 [warning] <0.15288.35> closing AMQP connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e, vhost: '/', user: 'openstack'):

Why does RMQ fail when I shut down the 2nd controller after successfully replacing the first one?

I apologize for the nonsense below. I have not been able to stop it from being attached to my external emails.

E-MAIL CONFIDENTIALITY NOTICE: 
The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210330/91f9bb7d/attachment-0001.html>


More information about the openstack-discuss mailing list