RE: Re: [kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller
It looks like the problem may be caused by incompatible versions of RMQ. How can I work around that? -----Original Message----- From: Braden, Albert Sent: Friday, April 2, 2021 8:34 AM To: 'openstack-discuss@lists.openstack.org' <openstack-discuss@lists.openstack.org> Subject: RE: Re: [kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller I opened a bug for this issue: https://bugs.launchpad.net/kolla-ansible/+bug/1922269 -----Original Message----- From: Braden, Albert Sent: Thursday, April 1, 2021 11:34 AM To: 'openstack-discuss@lists.openstack.org' <openstack-discuss@lists.openstack.org> Subject: RE: Re: [kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller Sorry that was a typo. Stopping RMQ during the removal of the *second* controller is what causes the problem. Is there a way to tell Centos 8 Train to use RMQ 3.7.24 instead of 3.7.28? -----Original Message----- From: Braden, Albert Sent: Thursday, April 1, 2021 9:34 AM To: 'openstack-discuss@lists.openstack.org' <openstack-discuss@lists.openstack.org> Subject: RE: Re: [kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller I did some experimenting and it looks like stopping RMQ during the removal of the first controller is what causes the problem. After deploying the first controller, stopping the RMQ container on any controller including the new centos8 controller will cause the entire cluster to stop. This crash dump appears on the controllers that stopped in sympathy: https://paste.ubuntu.com/p/ZDgFgKtQTB/ This appears in the RMQ log: https://paste.ubuntu.com/p/5D2Qjv3H8c/ -----Original Message----- From: Braden, Albert Sent: Wednesday, March 31, 2021 8:31 AM To: openstack-discuss@lists.openstack.org Subject: RE: Re: [kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller Centos7: {rabbit,"RabbitMQ","3.7.24"}, "Erlang/OTP 22 [erts-10.7.2.8] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:128] [hipe]\n"}, Centos8: {rabbit,"RabbitMQ","3.7.28"}, "Erlang/OTP 22 [erts-10.7.2.8] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:128] [hipe]\n"}, When I deploy the first Centos8 controller, RMQ comes up with all 3 nodes active and seems to be working fine until I shut down the 2nd controller. The only hint of trouble when I replace the 1st node is this error message the first time I run the deployment: https://paste.ubuntu.com/p/h9HWdfwmrK/ and the crash dump that appears on control2: crash dump log: https://paste.ubuntu.com/p/MpZ8SwTJ2T/ First 1500 lines of the dump: https://paste.ubuntu.com/p/xkCyp2B8j8/ If I wait for a few minutes then RMQ recovers on control2 and the 2nd run of the deployment seems to work, and there is no trouble until I shut down control1. -----Original Message----- From: Mark Goddard <mark@stackhpc.com> Sent: Wednesday, March 31, 2021 4:14 AM To: Braden, Albert <C-Albert.Braden@charter.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller CAUTION: The e-mail below is from an external source. Please exercise caution before opening attachments, clicking links, or following guidance. On Tue, 30 Mar 2021 at 13:41, Braden, Albert <C-Albert.Braden@charter.com> wrote:
I’ve created a heat stack and installed Openstack Train to test the Centos7->8 upgrade following the document here:
https://docs.openstack.org/kolla-ansible/train/user/centos8.html#migrating-f...
I used the instructions here to successfully remove and replace control0 with a Centos8 box
https://docs.openstack.org/kolla-ansible/train/user/adding-and-removing-host...
After this my RMQ admin page shows all 3 nodes up, including the new control0. The name of the cluster is rabbit@chrnc-void-testupgrade-control-0.dev.chtrse.com
(rabbitmq)[root@chrnc-void-testupgrade-control-2 /]# rabbitmqctl cluster_status
Cluster status of node rabbit@chrnc-void-testupgrade-control-2 ...
[{nodes,[{disc,['rabbit@chrnc-void-testupgrade-control-0',
'rabbit@chrnc-void-testupgrade-control-0-replace',
'rabbit@chrnc-void-testupgrade-control-1',
'rabbit@chrnc-void-testupgrade-control-2']}]},
{running_nodes,['rabbit@chrnc-void-testupgrade-control-0-replace',
'rabbit@chrnc-void-testupgrade-control-1',
'rabbit@chrnc-void-testupgrade-control-2']},
{cluster_name,<<"rabbit@chrnc-void-testupgrade-control-0.dev.chtrse.com">>},
{partitions,[]},
{alarms,[{'rabbit@chrnc-void-testupgrade-control-0-replace',[]},
{'rabbit@chrnc-void-testupgrade-control-1',[]},
{'rabbit@chrnc-void-testupgrade-control-2',[]}]}]
After that I create a new VM to verify that the cluster is still working, and then perform the same procedure on control1. When I shut down services on control1, the ansible playbook finishes successfully:
kolla-ansible -i ../multinode stop --yes-i-really-really-mean-it --limit control1
…
control1 : ok=45 changed=22 unreachable=0 failed=0 skipped=105 rescued=0 ignored=0
After this my RMQ admin page stops responding. When I check RMQ on the new control0 and the existing control2, the container is still up but RMQ is not running:
(rabbitmq)[root@chrnc-void-testupgrade-control-0-replace /]# rabbitmqctl cluster_status
Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.
If I start it on control0 and control2, then the cluster seems normal and the admin page starts working again, and cluster status looks normal:
(rabbitmq)[root@chrnc-void-testupgrade-control-0-replace /]# rabbitmqctl cluster_status
Cluster status of node rabbit@chrnc-void-testupgrade-control-0-replace ...
[{nodes,[{disc,['rabbit@chrnc-void-testupgrade-control-0',
'rabbit@chrnc-void-testupgrade-control-0-replace',
'rabbit@chrnc-void-testupgrade-control-1',
'rabbit@chrnc-void-testupgrade-control-2']}]},
{running_nodes,['rabbit@chrnc-void-testupgrade-control-2',
'rabbit@chrnc-void-testupgrade-control-0-replace']},
{cluster_name,<<"rabbit@chrnc-void-testupgrade-control-0.dev.chtrse.com">>},
{partitions,[]},
{alarms,[{'rabbit@chrnc-void-testupgrade-control-2',[]},
{'rabbit@chrnc-void-testupgrade-control-0-replace',[]}]}]
But my hypervisors are down:
(openstack) [root@chrnc-void-testupgrade-build kolla-ansible]# ohll
+----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State | vCPUs Used | vCPUs | Memory MB Used | Memory MB |
+----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+
| 3 | chrnc-void-testupgrade-compute-2.dev.chtrse.com | QEMU | 172.16.2.106 | down | 5 | 8 | 2560 | 30719 |
| 6 | chrnc-void-testupgrade-compute-0.dev.chtrse.com | QEMU | 172.16.2.31 | down | 5 | 8 | 2560 | 30719 |
| 9 | chrnc-void-testupgrade-compute-1.dev.chtrse.com | QEMU | 172.16.0.30 | down | 5 | 8 | 2560 | 30719 |
+----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+
When I look at the nova-compute.log on a compute node, I see RMQ failures every 10 seconds:
172.16.2.31 compute0
2021-03-30 03:07:54.893 7 ERROR oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] AMQP server on 172.16.1.132:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
2021-03-30 03:07:55.905 7 INFO oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] Reconnected to AMQP server on 172.16.1.132:5672 via [amqp] client with port 56422.
2021-03-30 03:08:05.915 7 ERROR oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] AMQP server on 172.16.1.132:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
In the RMQ logs I see this every 10 seconds:
172.16.1.132 control2
[root@chrnc-void-testupgrade-control-2 ~]# tail -f /var/log/kolla/rabbitmq/rabbit\@chrnc-void-testupgrade-control-2.log |grep 172.16.2.31
2021-03-30 03:07:54.895 [warning] <0.13247.35> closing AMQP connection <0.13247.35> (172.16.2.31:56420 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e, vhost: '/', user: 'openstack'):
client unexpectedly closed TCP connection
2021-03-30 03:07:55.901 [info] <0.15288.35> accepting AMQP connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672)
2021-03-30 03:07:55.903 [info] <0.15288.35> Connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672) has a client-provided name: nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e
2021-03-30 03:07:55.904 [info] <0.15288.35> connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e): user 'openstack' authenticated and granted access to vhost '/'
2021-03-30 03:08:05.916 [warning] <0.15288.35> closing AMQP connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e, vhost: '/', user: 'openstack'):
Why does RMQ fail when I shut down the 2nd controller after successfully replacing the first one?
Hi Albert, Could you share the versions of RabbitMQ and erlang in both versions of the container? When initially testing this setup, I think we had 3.7.24 on both sides. Perhaps the CentOS 8 version has moved on sufficiently to become incompatible? Mark
I apologize for the nonsense below. I have not been able to stop it from being attached to my external emails.
The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited.
E-MAIL CONFIDENTIALITY NOTICE: The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited.
On Mon, 5 Apr 2021 at 15:33, Braden, Albert <C-Albert.Braden@charter.com> wrote:
It looks like the problem may be caused by incompatible versions of RMQ. How can I work around that?
Hi Albert, thanks for testing this procedure and reporting issues. I suggest we continue the discussion on the bug report. https://bugs.launchpad.net/kolla-ansible/+bug/1922269 Mark
-----Original Message----- From: Braden, Albert Sent: Friday, April 2, 2021 8:34 AM To: 'openstack-discuss@lists.openstack.org' <openstack-discuss@lists.openstack.org> Subject: RE: Re: [kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller
I opened a bug for this issue:
https://bugs.launchpad.net/kolla-ansible/+bug/1922269
-----Original Message----- From: Braden, Albert Sent: Thursday, April 1, 2021 11:34 AM To: 'openstack-discuss@lists.openstack.org' <openstack-discuss@lists.openstack.org> Subject: RE: Re: [kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller
Sorry that was a typo. Stopping RMQ during the removal of the *second* controller is what causes the problem.
Is there a way to tell Centos 8 Train to use RMQ 3.7.24 instead of 3.7.28?
-----Original Message----- From: Braden, Albert Sent: Thursday, April 1, 2021 9:34 AM To: 'openstack-discuss@lists.openstack.org' <openstack-discuss@lists.openstack.org> Subject: RE: Re: [kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller
I did some experimenting and it looks like stopping RMQ during the removal of the first controller is what causes the problem. After deploying the first controller, stopping the RMQ container on any controller including the new centos8 controller will cause the entire cluster to stop. This crash dump appears on the controllers that stopped in sympathy:
https://paste.ubuntu.com/p/ZDgFgKtQTB/
This appears in the RMQ log:
https://paste.ubuntu.com/p/5D2Qjv3H8c/
-----Original Message----- From: Braden, Albert Sent: Wednesday, March 31, 2021 8:31 AM To: openstack-discuss@lists.openstack.org Subject: RE: Re: [kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller
Centos7:
{rabbit,"RabbitMQ","3.7.24"}, "Erlang/OTP 22 [erts-10.7.2.8] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:128] [hipe]\n"},
Centos8:
{rabbit,"RabbitMQ","3.7.28"}, "Erlang/OTP 22 [erts-10.7.2.8] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:128] [hipe]\n"},
When I deploy the first Centos8 controller, RMQ comes up with all 3 nodes active and seems to be working fine until I shut down the 2nd controller. The only hint of trouble when I replace the 1st node is this error message the first time I run the deployment:
https://paste.ubuntu.com/p/h9HWdfwmrK/
and the crash dump that appears on control2:
crash dump log:
https://paste.ubuntu.com/p/MpZ8SwTJ2T/
First 1500 lines of the dump:
https://paste.ubuntu.com/p/xkCyp2B8j8/
If I wait for a few minutes then RMQ recovers on control2 and the 2nd run of the deployment seems to work, and there is no trouble until I shut down control1.
-----Original Message----- From: Mark Goddard <mark@stackhpc.com> Sent: Wednesday, March 31, 2021 4:14 AM To: Braden, Albert <C-Albert.Braden@charter.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: [kolla] Train Centos7 -> Centos8 upgrade fails on 2nd controller
CAUTION: The e-mail below is from an external source. Please exercise caution before opening attachments, clicking links, or following guidance.
On Tue, 30 Mar 2021 at 13:41, Braden, Albert <C-Albert.Braden@charter.com> wrote:
I’ve created a heat stack and installed Openstack Train to test the Centos7->8 upgrade following the document here:
https://docs.openstack.org/kolla-ansible/train/user/centos8.html#migrating-f...
I used the instructions here to successfully remove and replace control0 with a Centos8 box
https://docs.openstack.org/kolla-ansible/train/user/adding-and-removing-host...
After this my RMQ admin page shows all 3 nodes up, including the new control0. The name of the cluster is rabbit@chrnc-void-testupgrade-control-0.dev.chtrse.com
(rabbitmq)[root@chrnc-void-testupgrade-control-2 /]# rabbitmqctl cluster_status
Cluster status of node rabbit@chrnc-void-testupgrade-control-2 ...
[{nodes,[{disc,['rabbit@chrnc-void-testupgrade-control-0',
'rabbit@chrnc-void-testupgrade-control-0-replace',
'rabbit@chrnc-void-testupgrade-control-1',
'rabbit@chrnc-void-testupgrade-control-2']}]},
{running_nodes,['rabbit@chrnc-void-testupgrade-control-0-replace',
'rabbit@chrnc-void-testupgrade-control-1',
'rabbit@chrnc-void-testupgrade-control-2']},
{cluster_name,<<"rabbit@chrnc-void-testupgrade-control-0.dev.chtrse.com">>},
{partitions,[]},
{alarms,[{'rabbit@chrnc-void-testupgrade-control-0-replace',[]},
{'rabbit@chrnc-void-testupgrade-control-1',[]},
{'rabbit@chrnc-void-testupgrade-control-2',[]}]}]
After that I create a new VM to verify that the cluster is still working, and then perform the same procedure on control1. When I shut down services on control1, the ansible playbook finishes successfully:
kolla-ansible -i ../multinode stop --yes-i-really-really-mean-it --limit control1
…
control1 : ok=45 changed=22 unreachable=0 failed=0 skipped=105 rescued=0 ignored=0
After this my RMQ admin page stops responding. When I check RMQ on the new control0 and the existing control2, the container is still up but RMQ is not running:
(rabbitmq)[root@chrnc-void-testupgrade-control-0-replace /]# rabbitmqctl cluster_status
Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.
If I start it on control0 and control2, then the cluster seems normal and the admin page starts working again, and cluster status looks normal:
(rabbitmq)[root@chrnc-void-testupgrade-control-0-replace /]# rabbitmqctl cluster_status
Cluster status of node rabbit@chrnc-void-testupgrade-control-0-replace ...
[{nodes,[{disc,['rabbit@chrnc-void-testupgrade-control-0',
'rabbit@chrnc-void-testupgrade-control-0-replace',
'rabbit@chrnc-void-testupgrade-control-1',
'rabbit@chrnc-void-testupgrade-control-2']}]},
{running_nodes,['rabbit@chrnc-void-testupgrade-control-2',
'rabbit@chrnc-void-testupgrade-control-0-replace']},
{cluster_name,<<"rabbit@chrnc-void-testupgrade-control-0.dev.chtrse.com">>},
{partitions,[]},
{alarms,[{'rabbit@chrnc-void-testupgrade-control-2',[]},
{'rabbit@chrnc-void-testupgrade-control-0-replace',[]}]}]
But my hypervisors are down:
(openstack) [root@chrnc-void-testupgrade-build kolla-ansible]# ohll
+----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State | vCPUs Used | vCPUs | Memory MB Used | Memory MB |
+----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+
| 3 | chrnc-void-testupgrade-compute-2.dev.chtrse.com | QEMU | 172.16.2.106 | down | 5 | 8 | 2560 | 30719 |
| 6 | chrnc-void-testupgrade-compute-0.dev.chtrse.com | QEMU | 172.16.2.31 | down | 5 | 8 | 2560 | 30719 |
| 9 | chrnc-void-testupgrade-compute-1.dev.chtrse.com | QEMU | 172.16.0.30 | down | 5 | 8 | 2560 | 30719 |
+----+-------------------------------------------------+-----------------+--------------+-------+------------+-------+----------------+-----------+
When I look at the nova-compute.log on a compute node, I see RMQ failures every 10 seconds:
172.16.2.31 compute0
2021-03-30 03:07:54.893 7 ERROR oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] AMQP server on 172.16.1.132:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
2021-03-30 03:07:55.905 7 INFO oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] Reconnected to AMQP server on 172.16.1.132:5672 via [amqp] client with port 56422.
2021-03-30 03:08:05.915 7 ERROR oslo.messaging._drivers.impl_rabbit [req-70d69b45-c3a7-4fbc-b709-4d7d757e09e7 - - - - -] [aeb317a8-873f-49be-a2a0-c6d6e0891a3e] AMQP server on 172.16.1.132:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
In the RMQ logs I see this every 10 seconds:
172.16.1.132 control2
[root@chrnc-void-testupgrade-control-2 ~]# tail -f /var/log/kolla/rabbitmq/rabbit\@chrnc-void-testupgrade-control-2.log |grep 172.16.2.31
2021-03-30 03:07:54.895 [warning] <0.13247.35> closing AMQP connection <0.13247.35> (172.16.2.31:56420 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e, vhost: '/', user: 'openstack'):
client unexpectedly closed TCP connection
2021-03-30 03:07:55.901 [info] <0.15288.35> accepting AMQP connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672)
2021-03-30 03:07:55.903 [info] <0.15288.35> Connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672) has a client-provided name: nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e
2021-03-30 03:07:55.904 [info] <0.15288.35> connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e): user 'openstack' authenticated and granted access to vhost '/'
2021-03-30 03:08:05.916 [warning] <0.15288.35> closing AMQP connection <0.15288.35> (172.16.2.31:56422 -> 172.16.1.132:5672 - nova-compute:7:aeb317a8-873f-49be-a2a0-c6d6e0891a3e, vhost: '/', user: 'openstack'):
Why does RMQ fail when I shut down the 2nd controller after successfully replacing the first one?
Hi Albert,
Could you share the versions of RabbitMQ and erlang in both versions of the container? When initially testing this setup, I think we had 3.7.24 on both sides. Perhaps the CentOS 8 version has moved on sufficiently to become incompatible?
Mark
I apologize for the nonsense below. I have not been able to stop it from being attached to my external emails.
The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited.
E-MAIL CONFIDENTIALITY NOTICE: The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited.
participants (2)
-
Braden, Albert
-
Mark Goddard