[EXTERNAL] Re: [kolla] Train Centos7 -> Centos8 upgrade fails on masakari

Braden, Albert C-Albert.Braden at charter.com
Thu Mar 25 21:01:09 UTC 2021


After about 2 hours the CPU settles down and then control0 joins the RMQ cluster and the admin display looks normal. The mysql container on control0 is stopped and the elasticsearch container is restarting every 60 seconds. 

acf52c003292   kolla/centos-source-mariadb:train-centos8                             "dumb-init -- kolla_…"   5 hours ago   Exited (128) 3 hours ago                  mariadb
9bc064cf9b2b   kolla/centos-source-elasticsearch6:train-centos8                      "dumb-init --single-…"   5 hours ago   Restarting (1) 20 seconds ago             elasticsearch

The mariadb container refuses to start:

[root at chrnc-void-testupgrade-control-0-replace keystone]# docker start mariadb
Error response from daemon: OCI runtime create failed: container with id exists: acf52c003292e4841af15bb3c2894b983e37de5a65fc726ae2db2049f0e6774c: unknown

I see a lot of this in mariadb.log on control0:

2021-03-25 17:29:16 13436 [Warning] Access denied for user 'haproxy'@'chrnc-void-testupgrade-control-2.dev.chtrse.com' (using password: NO)
2021-03-25 17:29:16 13437 [Warning] Access denied for user 'haproxy'@'chrnc-void-testupgrade-control-1.dev.chtrse.com' (using password: NO)
2021-03-25 17:29:17 13438 [Warning] Access denied for user 'haproxy'@'chrnc-void-testupgrade-control-0-replace.dev.chtrse.com' (using password: NO)
2021-03-25 17:29:18 13441 [Warning] Access denied for user 'haproxy'@'chrnc-void-testupgrade-control-2.dev.chtrse.com' (using password: NO)
2021-03-25 17:29:18 13442 [Warning] Access denied for user 'haproxy'@'chrnc-void-testupgrade-control-1.dev.chtrse.com' (using password: NO)

Here's the entire mariadb.log starting when I created the cluster and ending when the container died:

https://paste.ubuntu.com/p/FCn9pB6zV4/

The CPU is no longer consumed, but the times might be a clue:

    60 root      20   0       0      0      0 S   0.0   0.0 104:07.36 kswapd0
      1 root      20   0  253576   9756   4580 S   0.0   0.1  42:22.65 systemd
  28670 42472     20   0  180416  14140   5812 S   0.0   0.2  20:00.75 memcached_expor
  28515 42472     20   0  332512  29968   3408 S   0.0   0.4  12:27.02 mysqld_exporter
  28436 42472     20   0  251896  21584   6956 S   0.0   0.3  12:21.69 node_exporter
  14857 root      20   0  980460  45684   6592 S   0.0   0.6   9:55.42 containerd
  34608 42425     20   0  732436 106800   9012 S   0.0   1.4   9:03.50 httpd
  34609 42425     20   0  732436 106812   8736 S   0.0   1.4   9:01.90 httpd
  29123 42472     20   0   49360   9840   1288 S   0.3   0.1   8:00.38 elasticsearch_e
  15034 root      20   0 2793200  68112      0 S   0.0   0.9   6:36.63 dockerd
  23161 root      20   0  113120   6056      0 S   0.0   0.1   6:25.05 containerd-shim
  28592 42472     20   0  112820  16988   6348 S   0.0   0.2   6:00.40 haproxy_exporte
  57248 42472     20   0  253568  74152      0 S   0.0   1.0   5:47.56 openstack-expor
  28950 42472     20   0  120256  24664   9456 S   0.0   0.3   5:22.03 alertmanager
  31847 root      20   0  235156   2760   2196 S   0.0   0.0   5:00.97 bash

I'll build another cluster and watch top during the upgrade to see what is consuming the CPU.

-----Original Message-----
From: Radosław Piliszek <radoslaw.piliszek at gmail.com> 
Sent: Thursday, March 25, 2021 2:17 PM
To: Braden, Albert <C-Albert.Braden at charter.com>
Cc: openstack-discuss at lists.openstack.org
Subject: [EXTERNAL] Re: [kolla] Train Centos7 -> Centos8 upgrade fails on masakari

CAUTION: The e-mail below is from an external source. Please exercise caution before opening attachments, clicking links, or following guidance.

Hi Albert,

I can assure you this is unrelated to Masakari.
As you have observed, it's the RabbitMQ and Keystone (perhaps due to
MariaDB?) that failed.
Something is abusing the CPU there. What is that process?

-yoctozepto

On Thu, Mar 25, 2021 at 7:09 PM Braden, Albert
<C-Albert.Braden at charter.com> wrote:
>
> I’ve created a heat stack and installed Openstack Train to test the Centos7->8 upgrade following the document here:
>
>
>
> https://docs.openstack.org/kolla-ansible/train/user/centos8.html#migrating-from-centos-7-to-centos-8
>
>
>
> Everything seems to work fine until I try to deploy the first replacement controller into the cluster. I upgrade RMQ, ES and Kibana, then follow the “remove existing controller” process to remove control0, create a Centos 8 VM, bootstrap the cluster, pull containers to the new control0, and everything is still working. Then I type the last command “kolla-ansible -i multinode deploy --limit control”
>
>
>
> The RMQ install works and I see all 3 nodes up in the RMQ admin, but it takes a long time to complete “TASK [service-ks-register : masakari | Creating users] “ and then hangs on “TASK [service-ks-register : masakari | Creating roles]”. At this time the new control0 becomes unreachable and drops out of the RMQ cluster. I can still ping it but console hangs along with new and existing ssh sessions. It appears that the CPU may be maxed out and not allowing interrupts. Eventually I see error “fatal: [control0]: FAILED! => {"msg": "Timeout (12s) waiting for privilege escalation prompt: "}”
>
>
>
> RMQ seems fine on the 2 old controllers; they just don’t see the new control0 active:
>
>
>
> (rabbitmq)[root at chrnc-void-testupgrade-control-2 /]# rabbitmqctl cluster_status
>
> Cluster status of node rabbit at chrnc-void-testupgrade-control-2 ...
>
> [{nodes,[{disc,['rabbit at chrnc-void-testupgrade-control-0',
>
>                 'rabbit at chrnc-void-testupgrade-control-0-replace',
>
>                 'rabbit at chrnc-void-testupgrade-control-1',
>
>                 'rabbit at chrnc-void-testupgrade-control-2']}]},
>
> {running_nodes,['rabbit at chrnc-void-testupgrade-control-1',
>
>                  'rabbit at chrnc-void-testupgrade-control-2']},
>
> {cluster_name,<<"rabbit at chrnc-void-testupgrade-control-0.dev.chtrse.com">>},
>
> {partitions,[]},
>
> {alarms,[{'rabbit at chrnc-void-testupgrade-control-1',[]},
>
>           {'rabbit at chrnc-void-testupgrade-control-2',[]}]}]
>
>
>
> After this the HAProxy IP is pingable but openstack commands are failing:
>
>
>
> (openstack) [root at chrnc-void-testupgrade-build openstack]# osl
>
> Failed to discover available identity versions when contacting http://172.16.0.100:35357/v3. Attempting to parse version from URL.
>
> Gateway Timeout (HTTP 504)
>
>
>
> After about an hour my open ssh session on the new control0 responded and confirmed that the CPU is maxed out:
>
>
>
> [root at chrnc-void-testupgrade-control-0-replace /]# uptime
>
> 17:41:55 up  1:55,  0 users,  load average: 157.87, 299.75, 388.69
>
>
>
> I built new heat stacks and tried it a few times, and it consistently fails on masakari. Do I need to change something in my masakari config before upgrading Train from Centos 7 to Centos 8?
>
>
>
> I apologize for the nonsense below. I have not been able to stop it from being attached to my external emails.
>
>
>
> The contents of this e-mail message and
> any attachments are intended solely for the
> addressee(s) and may contain confidential
> and/or legally privileged information. If you
> are not the intended recipient of this message
> or if this message has been addressed to you
> in error, please immediately alert the sender
> by reply e-mail and then delete this message
> and any attachments. If you are not the
> intended recipient, you are notified that
> any use, dissemination, distribution, copying,
> or storage of this message or any attachment
> is strictly prohibited.
E-MAIL CONFIDENTIALITY NOTICE: 
The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited.


More information about the openstack-discuss mailing list