[kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller

Braden, Albert C-Albert.Braden at charter.com
Mon Apr 5 14:32:48 UTC 2021


It looks like the problem may be caused by incompatible versions of RMQ. How can I work around that?

-----Original Message-----
From: Braden, Albert 
Sent: Friday, April 2, 2021 8:34 AM
To: 'openstack-discuss at lists.openstack.org' <openstack-discuss at lists.openstack.org>
Subject: RE: Re: [kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller

I opened a bug for this issue:

https://bugs.launchpad.net/kolla-ansible/+bug/1922269

-----Original Message-----
From: Braden, Albert 
Sent: Thursday, April 1, 2021 11:34 AM
To: 'openstack-discuss at lists.openstack.org' <openstack-discuss at lists.openstack.org>
Subject: RE: Re: [kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller

Sorry that was a typo. Stopping RMQ during the removal of the *second* controller is what causes the problem.

Is there a way to tell Centos 8 Train to use RMQ 3.7.24 instead of 3.7.28?

-----Original Message-----
From: Braden, Albert 
Sent: Thursday, April 1, 2021 9:34 AM
To: 'openstack-discuss at lists.openstack.org' <openstack-discuss at lists.openstack.org>
Subject: RE: Re: [kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller

I did some experimenting and it looks like stopping RMQ during the removal of the first controller is what causes the problem. After deploying the first controller, stopping the RMQ container on any controller including the new centos8 controller will cause the entire cluster to stop. This crash dump appears on the controllers that stopped in sympathy:

https://paste.ubuntu.com/p/ZDgFgKtQTB/

This appears in the RMQ log:

https://paste.ubuntu.com/p/5D2Qjv3H8c/

-----Original Message-----
From: Braden, Albert 
Sent: Wednesday, March 31, 2021 8:31 AM
To: openstack-discuss at lists.openstack.org
Subject: RE: Re: [kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller

Centos7:

      {rabbit,"RabbitMQ","3.7.24"},
     "Erlang/OTP 22 [erts-10.7.2.8] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:128] [hipe]\n"},

Centos8:

      {rabbit,"RabbitMQ","3.7.28"},
     "Erlang/OTP 22 [erts-10.7.2.8] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:128] [hipe]\n"},

When I deploy the first Centos8 controller, RMQ comes up with all 3 nodes active and seems to be working fine until I shut down the 2nd controller. The only hint of trouble when I replace the 1st node is this error message the first time I run the deployment:

https://paste.ubuntu.com/p/h9HWdfwmrK/

and the crash dump that appears on control2:

crash dump log:

https://paste.ubuntu.com/p/MpZ8SwTJ2T/

First 1500 lines of the dump:

https://paste.ubuntu.com/p/xkCyp2B8j8/

If I wait for a few minutes then RMQ recovers on control2 and the 2nd run of the deployment seems to work, and there is no trouble until I shut down control1.

-----Original Message-----
From: Mark Goddard <mark at stackhpc.com> 
Sent: Wednesday, March 31, 2021 4:14 AM
To: Braden, Albert <C-Albert.Braden at charter.com>
Cc: openstack-discuss at lists.openstack.org
Subject: [EXTERNAL] Re: [kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller

CAUTION: The e-mail below is from an external source. Please exercise caution before opening attachments, clicking links, or following guidance.

On Tue, 30 Mar 2021 at 13:41, Braden, Albert
<C-Albert.Braden at charter.com> wrote:
>
> I’ve created a heat stack and installed Openstack Train to test the Centos7->8 upgrade following the document here:
>
>
>
> https://docs.openstack.org/kolla-ansible/train/user/centos8.html#migrating-from-centos-7-to-centos-8
>
>
>
> I used the instructions here to successfully remove and replace control0 with a Centos8 box
>
>
>
> https://docs.openstack.org/kolla-ansible/train/user/adding-and-removing-hosts.html#removing-existing-controllers
>
>
>
> After this my RMQ admin page shows all 3 nodes up, including the new control0. The name of the cluster is rabbit at chrnc-void-testupgrade-control-0.dev.chtrse.com
>
>
>
> (rabbitmq)[root at chrnc-void-testupgrade-control-2 /]# rabbitmqctl cluster_status
>
> Cluster status of node rabbit at chrnc-void-testupgrade-control-2 ...
>
> [{nodes,[{disc,['rabbit at chrnc-void-testupgrade-control-0',
>
>                 'rabbit at chrnc-void-testupgrade-control-0-replace',
>
>                 'rabbit at chrnc-void-testupgrade-control-1',
>
>                 'rabbit at chrnc-void-testupgrade-control-2']}]},
>
> {running_nodes,['rabbit at chrnc-void-testupgrade-control-0-replace',
>
>                  'rabbit at chrnc-void-testupgrade-control-1',
>
>                  'rabbit at chrnc-void-testupgrade-control-2']},
>
> {cluster_name,<<"rabbit at chrnc-void-testupgrade-control-0.dev.chtrse.com">>},
>
> {partitions,[]},
>
> {alarms,[{'rabbit at chrnc-void-testupgrade-control-0-replace',[]},
>
>           {'rabbit at chrnc-void-testupgrade-control-1',[]},
>
>           {'rabbit at chrnc-void-testupgrade-control-2',[]}]}]
>
>
>
> After that I create a new VM to verify that the cluster is still working, and then perform the same procedure on control1. When I shut down services on control1, the ansible playbook finishes successfully:
>
>
>
> kolla-ansible -i ../multinode stop --yes-i-really-really-mean-it --limit control1
>
>>
> control1                   : ok=45   changed=22   unreachable=0    failed=0    skipped=105  rescued=0    ignored=0
>
>
>
> After this my RMQ admin page stops responding. When I check RMQ on the new control0 and the existing control2, the container is still up but RMQ is not running:
>
>
>
> (rabbitmq)[root at chrnc-void-testupgrade-control-0-replace /]# rabbitmqctl cluster_status
>
> Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.
>
>
>
> If I start it on control0 and control2, then the cluster seems normal and the admin page starts working again, and cluster status looks normal:
>
>
>
> (rabbitmq)[root at chrnc-void-testupgrade-control-0-replace /]# rabbitmqctl cluster_status
>
> Cluster status of node rabbit at chrnc-void-testupgrade-control-0-replace ...
>
> [{nodes,[{disc,['rabbit at chrnc-void-testupgrade-control-0',
>
>                 'rabbit at chrnc-void-testupgrade-control-0-replace',
>
>                 'rabbit at chrnc-void-testupgrade-control-1',
>
>                 'rabbit at chrnc-void-testupgrade-control-2']}]},
>
> {running_nodes,['rabbit at chrnc-void-testupgrade-control-2',
>
>                  'rabbit at chrnc-void-testupgrade-control-0-replace']},
>
> {cluster_name,<<"rabbit at chrnc-void-testupgrade-control-0.dev.chtrse.com">>},
>
> {partitions,[]},
>
> {alarms,[{'rabbit at chrnc-void-testupgrade-control-2',[]},
>
>           {'rabbit at chrnc-void-testupgrade-control-0-replace',[]}]}]
>
>
>
> But my hypervisors are down:
>
>
>
> (openstack) [root at chrnc-void-testupgrade-build kolla-ansible]# ohll
>
> +----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+
>
> | ID | Hypervisor Hostname                             | Hypervisor Type | Host IP      | State | vCPUs Used | vCPUs | Memory MB Used | Memory MB |
>
> +----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+
>
> |  3 | chrnc-void-testupgrade-compute-2.dev.chtrse.com | QEMU            | 172.16.2.106 | down  |          5 |     8 |           2560 |     30719 |
>
> |  6 | chrnc-void-testupgrade-compute-0.dev.chtrse.com | QEMU            | 172.16.2.31  | down  |          5 |     8 |           2560 |     30719 |
>
> |  9 | chrnc-void-testupgrade-compute-1.dev.chtrse.com | QEMU            | 172.16.0.30  | down  |          5 |     8 |           2560 |     30719 |
>
> +----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+
>
>
>
> When I look at the nova-compute.log on a compute node, I see RMQ failures every 10 seconds:
>
>
>
> 172.16.2.31 compute0
>
> 2021-03-30 03:07:54.893 7 ERROR oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] AMQP server on 172.16.1.132:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
>
> 2021-03-30 03:07:55.905 7 INFO oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] Reconnected to AMQP server on 172.16.1.132:5672 via [amqp] client with port 56422.
>
> 2021-03-30 03:08:05.915 7 ERROR oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] AMQP server on 172.16.1.132:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
>
>
>
> In the RMQ logs I see this every 10 seconds:
>
>
>
> 172.16.1.132 control2
>
> [root at chrnc-void-testupgrade-control-2 ~]# tail -f /var/log/kolla/rabbitmq/rabbit\@chrnc-void-testupgrade-control-2.log |grep 172.16.2.31
>
> 2021-03-30 03:07:54.895 [warning] <0.13247.35> closing AMQP connection <0.13247.35> (172.16.2.31:56420 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e, vhost: '/', user: 'openstack'):
>
> client unexpectedly closed TCP connection
>
> 2021-03-30 03:07:55.901 [info] <0.15288.35> accepting AMQP connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672)
>
> 2021-03-30 03:07:55.903 [info] <0.15288.35> Connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672) has a client-provided name: nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e
>
> 2021-03-30 03:07:55.904 [info] <0.15288.35> connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e): user 'openstack' authenticated and granted access to vhost '/'
>
> 2021-03-30 03:08:05.916 [warning] <0.15288.35> closing AMQP connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e, vhost: '/', user: 'openstack'):
>
>
>
> Why does RMQ fail when I shut down the 2nd controller after successfully replacing the first one?

Hi Albert,

Could you share the versions of RabbitMQ and erlang in both versions
of the container? When initially testing this setup, I think we had
3.7.24 on both sides. Perhaps the CentOS 8 version has moved on
sufficiently to become incompatible?

Mark
>
>
>
> I apologize for the nonsense below. I have not been able to stop it from being attached to my external emails.
>
>
>
> The contents of this e-mail message and
> any attachments are intended solely for the
> addressee(s) and may contain confidential
> and/or legally privileged information. If you
> are not the intended recipient of this message
> or if this message has been addressed to you
> in error, please immediately alert the sender
> by reply e-mail and then delete this message
> and any attachments. If you are not the
> intended recipient, you are notified that
> any use, dissemination, distribution, copying,
> or storage of this message or any attachment
> is strictly prohibited.
E-MAIL CONFIDENTIALITY NOTICE: 
The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited.


More information about the openstack-discuss mailing list