[kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller

Mark Goddard mark at stackhpc.com
Wed Mar 31 08:14:13 UTC 2021


On Tue, 30 Mar 2021 at 13:41, Braden, Albert
<C-Albert.Braden at charter.com> wrote:
>
> I’ve created a heat stack and installed Openstack Train to test the Centos7->8 upgrade following the document here:
>
>
>
> https://docs.openstack.org/kolla-ansible/train/user/centos8.html#migrating-from-centos-7-to-centos-8
>
>
>
> I used the instructions here to successfully remove and replace control0 with a Centos8 box
>
>
>
> https://docs.openstack.org/kolla-ansible/train/user/adding-and-removing-hosts.html#removing-existing-controllers
>
>
>
> After this my RMQ admin page shows all 3 nodes up, including the new control0. The name of the cluster is rabbit at chrnc-void-testupgrade-control-0.dev.chtrse.com
>
>
>
> (rabbitmq)[root at chrnc-void-testupgrade-control-2 /]# rabbitmqctl cluster_status
>
> Cluster status of node rabbit at chrnc-void-testupgrade-control-2 ...
>
> [{nodes,[{disc,['rabbit at chrnc-void-testupgrade-control-0',
>
>                 'rabbit at chrnc-void-testupgrade-control-0-replace',
>
>                 'rabbit at chrnc-void-testupgrade-control-1',
>
>                 'rabbit at chrnc-void-testupgrade-control-2']}]},
>
> {running_nodes,['rabbit at chrnc-void-testupgrade-control-0-replace',
>
>                  'rabbit at chrnc-void-testupgrade-control-1',
>
>                  'rabbit at chrnc-void-testupgrade-control-2']},
>
> {cluster_name,<<"rabbit at chrnc-void-testupgrade-control-0.dev.chtrse.com">>},
>
> {partitions,[]},
>
> {alarms,[{'rabbit at chrnc-void-testupgrade-control-0-replace',[]},
>
>           {'rabbit at chrnc-void-testupgrade-control-1',[]},
>
>           {'rabbit at chrnc-void-testupgrade-control-2',[]}]}]
>
>
>
> After that I create a new VM to verify that the cluster is still working, and then perform the same procedure on control1. When I shut down services on control1, the ansible playbook finishes successfully:
>
>
>
> kolla-ansible -i ../multinode stop --yes-i-really-really-mean-it --limit control1
>
>>
> control1                   : ok=45   changed=22   unreachable=0    failed=0    skipped=105  rescued=0    ignored=0
>
>
>
> After this my RMQ admin page stops responding. When I check RMQ on the new control0 and the existing control2, the container is still up but RMQ is not running:
>
>
>
> (rabbitmq)[root at chrnc-void-testupgrade-control-0-replace /]# rabbitmqctl cluster_status
>
> Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.
>
>
>
> If I start it on control0 and control2, then the cluster seems normal and the admin page starts working again, and cluster status looks normal:
>
>
>
> (rabbitmq)[root at chrnc-void-testupgrade-control-0-replace /]# rabbitmqctl cluster_status
>
> Cluster status of node rabbit at chrnc-void-testupgrade-control-0-replace ...
>
> [{nodes,[{disc,['rabbit at chrnc-void-testupgrade-control-0',
>
>                 'rabbit at chrnc-void-testupgrade-control-0-replace',
>
>                 'rabbit at chrnc-void-testupgrade-control-1',
>
>                 'rabbit at chrnc-void-testupgrade-control-2']}]},
>
> {running_nodes,['rabbit at chrnc-void-testupgrade-control-2',
>
>                  'rabbit at chrnc-void-testupgrade-control-0-replace']},
>
> {cluster_name,<<"rabbit at chrnc-void-testupgrade-control-0.dev.chtrse.com">>},
>
> {partitions,[]},
>
> {alarms,[{'rabbit at chrnc-void-testupgrade-control-2',[]},
>
>           {'rabbit at chrnc-void-testupgrade-control-0-replace',[]}]}]
>
>
>
> But my hypervisors are down:
>
>
>
> (openstack) [root at chrnc-void-testupgrade-build kolla-ansible]# ohll
>
> +----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+
>
> | ID | Hypervisor Hostname                             | Hypervisor Type | Host IP      | State | vCPUs Used | vCPUs | Memory MB Used | Memory MB |
>
> +----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+
>
> |  3 | chrnc-void-testupgrade-compute-2.dev.chtrse.com | QEMU            | 172.16.2.106 | down  |          5 |     8 |           2560 |     30719 |
>
> |  6 | chrnc-void-testupgrade-compute-0.dev.chtrse.com | QEMU            | 172.16.2.31  | down  |          5 |     8 |           2560 |     30719 |
>
> |  9 | chrnc-void-testupgrade-compute-1.dev.chtrse.com | QEMU            | 172.16.0.30  | down  |          5 |     8 |           2560 |     30719 |
>
> +----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+
>
>
>
> When I look at the nova-compute.log on a compute node, I see RMQ failures every 10 seconds:
>
>
>
> 172.16.2.31 compute0
>
> 2021-03-30 03:07:54.893 7 ERROR oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] AMQP server on 172.16.1.132:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
>
> 2021-03-30 03:07:55.905 7 INFO oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] Reconnected to AMQP server on 172.16.1.132:5672 via [amqp] client with port 56422.
>
> 2021-03-30 03:08:05.915 7 ERROR oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] AMQP server on 172.16.1.132:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
>
>
>
> In the RMQ logs I see this every 10 seconds:
>
>
>
> 172.16.1.132 control2
>
> [root at chrnc-void-testupgrade-control-2 ~]# tail -f /var/log/kolla/rabbitmq/rabbit\@chrnc-void-testupgrade-control-2.log |grep 172.16.2.31
>
> 2021-03-30 03:07:54.895 [warning] <0.13247.35> closing AMQP connection <0.13247.35> (172.16.2.31:56420 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e, vhost: '/', user: 'openstack'):
>
> client unexpectedly closed TCP connection
>
> 2021-03-30 03:07:55.901 [info] <0.15288.35> accepting AMQP connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672)
>
> 2021-03-30 03:07:55.903 [info] <0.15288.35> Connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672) has a client-provided name: nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e
>
> 2021-03-30 03:07:55.904 [info] <0.15288.35> connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e): user 'openstack' authenticated and granted access to vhost '/'
>
> 2021-03-30 03:08:05.916 [warning] <0.15288.35> closing AMQP connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e, vhost: '/', user: 'openstack'):
>
>
>
> Why does RMQ fail when I shut down the 2nd controller after successfully replacing the first one?

Hi Albert,

Could you share the versions of RabbitMQ and erlang in both versions
of the container? When initially testing this setup, I think we had
3.7.24 on both sides. Perhaps the CentOS 8 version has moved on
sufficiently to become incompatible?

Mark
>
>
>
> I apologize for the nonsense below. I have not been able to stop it from being attached to my external emails.
>
>
>
> The contents of this e-mail message and
> any attachments are intended solely for the
> addressee(s) and may contain confidential
> and/or legally privileged information. If you
> are not the intended recipient of this message
> or if this message has been addressed to you
> in error, please immediately alert the sender
> by reply e-mail and then delete this message
> and any attachments. If you are not the
> intended recipient, you are notified that
> any use, dissemination, distribution, copying,
> or storage of this message or any attachment
> is strictly prohibited.



More information about the openstack-discuss mailing list