[EXTERNAL] Re: [kolla] Train Centos7 -> Centos8 upgrade fails on masakari
Mark Goddard
mark at stackhpc.com
Fri Mar 26 08:43:53 UTC 2021
On Thu, 25 Mar 2021 at 21:01, Braden, Albert
<C-Albert.Braden at charter.com> wrote:
>
> After about 2 hours the CPU settles down and then control0 joins the RMQ cluster and the admin display looks normal. The mysql container on control0 is stopped and the elasticsearch container is restarting every 60 seconds.
>
> acf52c003292 kolla/centos-source-mariadb:train-centos8 "dumb-init -- kolla_…" 5 hours ago Exited (128) 3 hours ago mariadb
> 9bc064cf9b2b kolla/centos-source-elasticsearch6:train-centos8 "dumb-init --single-…" 5 hours ago Restarting (1) 20 seconds ago elasticsearch
>
> The mariadb container refuses to start:
>
> [root at chrnc-void-testupgrade-control-0-replace keystone]# docker start mariadb
> Error response from daemon: OCI runtime create failed: container with id exists: acf52c003292e4841af15bb3c2894b983e37de5a65fc726ae2db2049f0e6774c: unknown
>
> I see a lot of this in mariadb.log on control0:
>
> 2021-03-25 17:29:16 13436 [Warning] Access denied for user 'haproxy'@'chrnc-void-testupgrade-control-2.dev.chtrse.com' (using password: NO)
> 2021-03-25 17:29:16 13437 [Warning] Access denied for user 'haproxy'@'chrnc-void-testupgrade-control-1.dev.chtrse.com' (using password: NO)
> 2021-03-25 17:29:17 13438 [Warning] Access denied for user 'haproxy'@'chrnc-void-testupgrade-control-0-replace.dev.chtrse.com' (using password: NO)
> 2021-03-25 17:29:18 13441 [Warning] Access denied for user 'haproxy'@'chrnc-void-testupgrade-control-2.dev.chtrse.com' (using password: NO)
> 2021-03-25 17:29:18 13442 [Warning] Access denied for user 'haproxy'@'chrnc-void-testupgrade-control-1.dev.chtrse.com' (using password: NO)
>
> Here's the entire mariadb.log starting when I created the cluster and ending when the container died:
>
> https://paste.ubuntu.com/p/FCn9pB6zV4/
>
> The CPU is no longer consumed, but the times might be a clue:
>
> 60 root 20 0 0 0 0 S 0.0 0.0 104:07.36 kswapd0
> 1 root 20 0 253576 9756 4580 S 0.0 0.1 42:22.65 systemd
> 28670 42472 20 0 180416 14140 5812 S 0.0 0.2 20:00.75 memcached_expor
> 28515 42472 20 0 332512 29968 3408 S 0.0 0.4 12:27.02 mysqld_exporter
> 28436 42472 20 0 251896 21584 6956 S 0.0 0.3 12:21.69 node_exporter
> 14857 root 20 0 980460 45684 6592 S 0.0 0.6 9:55.42 containerd
> 34608 42425 20 0 732436 106800 9012 S 0.0 1.4 9:03.50 httpd
> 34609 42425 20 0 732436 106812 8736 S 0.0 1.4 9:01.90 httpd
> 29123 42472 20 0 49360 9840 1288 S 0.3 0.1 8:00.38 elasticsearch_e
> 15034 root 20 0 2793200 68112 0 S 0.0 0.9 6:36.63 dockerd
> 23161 root 20 0 113120 6056 0 S 0.0 0.1 6:25.05 containerd-shim
> 28592 42472 20 0 112820 16988 6348 S 0.0 0.2 6:00.40 haproxy_exporte
> 57248 42472 20 0 253568 74152 0 S 0.0 1.0 5:47.56 openstack-expor
> 28950 42472 20 0 120256 24664 9456 S 0.0 0.3 5:22.03 alertmanager
> 31847 root 20 0 235156 2760 2196 S 0.0 0.0 5:00.97 bash
>
> I'll build another cluster and watch top during the upgrade to see what is consuming the CPU.
Hi Albert,
I would suggest using the --tags argument to go through the
kolla-ansible deploy on the new controller service by service. You can
check site.yml for the order of the plays.
Mark
>
> -----Original Message-----
> From: Radosław Piliszek <radoslaw.piliszek at gmail.com>
> Sent: Thursday, March 25, 2021 2:17 PM
> To: Braden, Albert <C-Albert.Braden at charter.com>
> Cc: openstack-discuss at lists.openstack.org
> Subject: [EXTERNAL] Re: [kolla] Train Centos7 -> Centos8 upgrade fails on masakari
>
> CAUTION: The e-mail below is from an external source. Please exercise caution before opening attachments, clicking links, or following guidance.
>
> Hi Albert,
>
> I can assure you this is unrelated to Masakari.
> As you have observed, it's the RabbitMQ and Keystone (perhaps due to
> MariaDB?) that failed.
> Something is abusing the CPU there. What is that process?
>
> -yoctozepto
>
> On Thu, Mar 25, 2021 at 7:09 PM Braden, Albert
> <C-Albert.Braden at charter.com> wrote:
> >
> > I’ve created a heat stack and installed Openstack Train to test the Centos7->8 upgrade following the document here:
> >
> >
> >
> > https://docs.openstack.org/kolla-ansible/train/user/centos8.html#migrating-from-centos-7-to-centos-8
> >
> >
> >
> > Everything seems to work fine until I try to deploy the first replacement controller into the cluster. I upgrade RMQ, ES and Kibana, then follow the “remove existing controller” process to remove control0, create a Centos 8 VM, bootstrap the cluster, pull containers to the new control0, and everything is still working. Then I type the last command “kolla-ansible -i multinode deploy --limit control”
> >
> >
> >
> > The RMQ install works and I see all 3 nodes up in the RMQ admin, but it takes a long time to complete “TASK [service-ks-register : masakari | Creating users] “ and then hangs on “TASK [service-ks-register : masakari | Creating roles]”. At this time the new control0 becomes unreachable and drops out of the RMQ cluster. I can still ping it but console hangs along with new and existing ssh sessions. It appears that the CPU may be maxed out and not allowing interrupts. Eventually I see error “fatal: [control0]: FAILED! => {"msg": "Timeout (12s) waiting for privilege escalation prompt: "}”
> >
> >
> >
> > RMQ seems fine on the 2 old controllers; they just don’t see the new control0 active:
> >
> >
> >
> > (rabbitmq)[root at chrnc-void-testupgrade-control-2 /]# rabbitmqctl cluster_status
> >
> > Cluster status of node rabbit at chrnc-void-testupgrade-control-2 ...
> >
> > [{nodes,[{disc,['rabbit at chrnc-void-testupgrade-control-0',
> >
> > 'rabbit at chrnc-void-testupgrade-control-0-replace',
> >
> > 'rabbit at chrnc-void-testupgrade-control-1',
> >
> > 'rabbit at chrnc-void-testupgrade-control-2']}]},
> >
> > {running_nodes,['rabbit at chrnc-void-testupgrade-control-1',
> >
> > 'rabbit at chrnc-void-testupgrade-control-2']},
> >
> > {cluster_name,<<"rabbit at chrnc-void-testupgrade-control-0.dev.chtrse.com">>},
> >
> > {partitions,[]},
> >
> > {alarms,[{'rabbit at chrnc-void-testupgrade-control-1',[]},
> >
> > {'rabbit at chrnc-void-testupgrade-control-2',[]}]}]
> >
> >
> >
> > After this the HAProxy IP is pingable but openstack commands are failing:
> >
> >
> >
> > (openstack) [root at chrnc-void-testupgrade-build openstack]# osl
> >
> > Failed to discover available identity versions when contacting http://172.16.0.100:35357/v3. Attempting to parse version from URL.
> >
> > Gateway Timeout (HTTP 504)
> >
> >
> >
> > After about an hour my open ssh session on the new control0 responded and confirmed that the CPU is maxed out:
> >
> >
> >
> > [root at chrnc-void-testupgrade-control-0-replace /]# uptime
> >
> > 17:41:55 up 1:55, 0 users, load average: 157.87, 299.75, 388.69
> >
> >
> >
> > I built new heat stacks and tried it a few times, and it consistently fails on masakari. Do I need to change something in my masakari config before upgrading Train from Centos 7 to Centos 8?
> >
> >
> >
> > I apologize for the nonsense below. I have not been able to stop it from being attached to my external emails.
> >
> >
> >
> > The contents of this e-mail message and
> > any attachments are intended solely for the
> > addressee(s) and may contain confidential
> > and/or legally privileged information. If you
> > are not the intended recipient of this message
> > or if this message has been addressed to you
> > in error, please immediately alert the sender
> > by reply e-mail and then delete this message
> > and any attachments. If you are not the
> > intended recipient, you are notified that
> > any use, dissemination, distribution, copying,
> > or storage of this message or any attachment
> > is strictly prohibited.
> E-MAIL CONFIDENTIALITY NOTICE:
> The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited.
More information about the openstack-discuss
mailing list