[magnum] [kolla-ansible] [kayobe] [Victoria] Magnum Kubernetes cluster failure recovery
feilong
feilong at catalystcloud.nz
Wed Aug 11 09:06:33 UTC 2021
Hi Tony,
If I understand correctly, now you're Magnum env can create k8s cluster
successfully. But the auto scaling failure caused the update_failed
status, is it? If so, cluster resize should be able to bring the cluster
back. And you can just resize the cluster to the current node number.
For that case, magnum should be able to fix the heat stack.
If you failed with resize, then better check the heat log to understand
why the heat stack update failed.
On 11/08/21 7:04 pm, Tony Pearce wrote:
> I sent this mail last week looking for some insight with regards to a
> magnum issue we had. I hadnt seen any reply and searched for my sent
> mail - I found I did not complete the subject line. Sorry about that.
>
> Resending again here with a subject. If anyone has any insight to this
> I'd be grateful to hear from you.
>
> Kind regards,
>
> Tony Pearce
>
>
>
>
> ---------- Forwarded message ---------
> From: *Tony Pearce* <tonyppe at gmail.com <mailto:tonyppe at gmail.com>>
> Date: Thu, 5 Aug 2021 at 14:22
> Subject: [magnum] [kolla-ansible] [kayobe] [Victoria]
> To: OpenStack Discuss <openstack-discuss at lists.openstack.org
> <mailto:openstack-discuss at lists.openstack.org>>
>
>
> Testing out Kubernetes with Magnum project, deployed via kayobe on
> Victoria we have deployed an auto scaling cluster and have run into a
> problem and I'm not sure how to proceed. I understand that the cluster
> tried to scale up but the openstack project did not have enough CPU
> resources to accommodate it (error= Quota exceeded for cores:
> Requested 4, but already used 20 of 20 cores).
>
> So the situation is that the cluster shows "healthy" and
> "UPDATE_FAILED" but also kubectl commands are failing [1].
>
> What is required to return the cluster back to a working status at
> this point? I have tried:
> - cluster resize to reduce number of workers
> - cluster resize to increase number of workers after increasing
> project quota
> - cluster resize and maintaining the same number of workers
>
> When trying any of the above, horizon shows an immediate error "Unable
> to resize given cluster" but magnum logs and heat logs do not show any
> log update at all at that time.
>
> - using "check stack" and resume stack in the stack horizon menu gives
> this error [2]
>
> Investigating the kubectl issue, it was noted that some services had
> failed on the master node [3]. Manual start as well as reboot the node
> did not bring up the services. Unfortunately I dont have ssh access to
> the master and no further information has been forthcoming with
> regards to logs for those service failures so I am unable to provide
> anything around that here.
>
> I found this link [4] so I decided to delete the master node then run
> "check" cluster again but the check cluster just fails in the same way
> except this time it fails saying that it cannot find the master [5]
> while the previous error was that it could not find a node.
>
> Ideally I would prefer to recover the cluster - whether this is still
> possible I am unsure. I can probably recreate this scenario again.
> What steps should be performed in this case to restore the cluster?
>
>
>
> [1]
> kubectl get no
> Error from server (Timeout): the server was unable to return a
> response in the time allotted, but may still be processing the request
> (get nodes)
>
> [2]
> Resource CHECK failed: ["['NotFound:
> resources[4].resources.kube-minion: Instance None could not be found.
> (HTTP 404) (Request-ID: req-6069ff6a-9eb6-4bce-bb25-4ef001ebc428)'].
> 'CHECK' not fully supported (see resources)"]
>
> [3]
>
> [systemd]
> Failed Units: 3
> etcd.service
> heat-container-agent.service
> logrotate.service
>
> [4] https://bugzilla.redhat.com/show_bug.cgi?id=1459854
>
> [5]
>
> ["['NotFound:
> resources.kube_masters.resources[0].resources.kube-master: Instance
> c6185e8e-1a98-4925-959b-0a56210b8c9e could not be found. (HTTP 404)
> (Request-ID: req-bdfcc853-7dbb-4022-9208-68b1ab31008a)']. 'CHECK' not
> fully supported (see resources)"].
>
> Kind regards,
>
> Tony Pearce
--
Cheers & Best regards,
------------------------------------------------------------------------------
Feilong Wang (王飞龙) (he/him)
Head of Research & Development
Catalyst Cloud
Aotearoa's own
Mob: +64 21 0832 6348 | www.catalystcloud.nz
Level 6, 150 Willis Street, Wellington 6011, New Zealand
CONFIDENTIALITY NOTICE: This email is intended for the named recipients only.
It may contain privileged, confidential or copyright information. If you are
not the named recipient, any use, reliance upon, disclosure or copying of this
email or its attachments is unauthorised. If you have received this email in
error, please reply via email or call +64 21 0832 6348.
------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210811/b3f49cb8/attachment.html>
More information about the openstack-discuss
mailing list