[magnum] [kolla-ansible] [kayobe] [Victoria] Magnum Kubernetes cluster failure recovery

Sven Kieske S.Kieske at mittwald.de
Wed Aug 11 10:16:33 UTC 2021


Hi Tony,

we don't run victoria release but maybe I can give you some pointers
where to look:

as far as I understand Magnum and the kubernetes autoscaler, magnum uses
heat to create stacks for the initial kubernetes deployment.

the problem is, that the kubernetes autoscaler directly talks to the openstack api, e.g.
nova for creating and destroying instances.

This can result in some weird situations, e.g. the autoscaler deletes volumes
but heat never is involved, so heat still thinks a volume is there which isn't.

so you might want to check all resources in your magnum heat stacks, if they are really
there, or if autoscaler did things to them.

if you e.g. find resources deleted by autoscaler, mark the appropriate heat stack as unhealthy
and trigger a stack update, so heat can do it's thing.

the stack should then return in a healthy status.

if someone has a solution to this general problem, I would be very interested
(beside the obvious "solution" to just disable the autoscaler)!

HTH

Sven

On Mi, 2021-08-11 at 15:04 +0800, Tony Pearce wrote:
> I sent this mail last week looking for some insight with regards to a
> magnum issue we had. I hadnt seen any reply and searched for my sent mail -
> I found I did not complete the subject line. Sorry about that.
> 
> Resending again here with a subject. If anyone has any insight to this I'd
> be grateful to hear from you.
> 
> Kind regards,
> 
> Tony Pearce
> 
> 
> 
> 
> ---------- Forwarded message ---------
> From: Tony Pearce <tonyppe at gmail.com>
> Date: Thu, 5 Aug 2021 at 14:22
> Subject: [magnum] [kolla-ansible] [kayobe] [Victoria]
> To: OpenStack Discuss <openstack-discuss at lists.openstack.org>
> 
> 
> Testing out Kubernetes with Magnum project, deployed via kayobe on Victoria
> we have deployed an auto scaling cluster and have run into a problem and
> I'm not sure how to proceed. I understand that the cluster tried to scale
> up but the openstack project did not have enough CPU resources to
> accommodate it (error= Quota exceeded for cores: Requested 4, but already
> used 20 of 20 cores).
> 
> So the situation is that the cluster shows "healthy" and "UPDATE_FAILED"
> but also kubectl commands are failing [1].
> 
> What is required to return the cluster back to a working status at this
> point? I have tried:
> - cluster resize to reduce number of workers
> - cluster resize to increase number of workers after increasing project
> quota
> - cluster resize and maintaining the same number of workers
> 
> When trying any of the above, horizon shows an immediate error "Unable to
> resize given cluster" but magnum logs and heat logs do not show any log
> update at all at that time.
> 
> - using "check stack" and resume stack in the stack horizon menu gives this
> error [2]
> 
> Investigating the kubectl issue, it was noted that some services had failed
> on the master node [3]. Manual start as well as reboot the node did not
> bring up the services. Unfortunately I dont have ssh access to the master
> and no further information has been forthcoming with regards to logs for
> those service failures so I am unable to provide anything around that here.
> 
> I found this link [4] so I decided to delete the master node then run
> "check" cluster again but the check cluster just fails in the same way
> except this time it fails saying that it cannot find the master [5] while
> the previous error was that it could not find a node.
> 
> Ideally I would prefer to recover the cluster - whether this is still
> possible I am unsure. I can probably recreate this scenario again. What
> steps should be performed in this case to restore the cluster?
> 
> 
> 
> [1]
> kubectl get no
> Error from server (Timeout): the server was unable to return a response in
> the time allotted, but may still be processing the request (get nodes)
> 
> [2]
> Resource CHECK failed: ["['NotFound: resources[4].resources.kube-minion:
> Instance None could not be found. (HTTP 404) (Request-ID:
> req-6069ff6a-9eb6-4bce-bb25-4ef001ebc428)']. 'CHECK' not fully supported
> (see resources)"]
> 
> [3]
> 
> [systemd]
> Failed Units: 3
>   etcd.service
>   heat-container-agent.service
>   logrotate.service
> 
> [4] https://bugzilla.redhat.com/show_bug.cgi?id=1459854
> 
> [5]
> 
> ["['NotFound: resources.kube_masters.resources[0].resources.kube-master:
> Instance c6185e8e-1a98-4925-959b-0a56210b8c9e could not be found. (HTTP
> 404) (Request-ID: req-bdfcc853-7dbb-4022-9208-68b1ab31008a)']. 'CHECK' not
> fully supported (see resources)"].
> 
> Kind regards,
> 
> Tony Pearce

-- 
Mit freundlichen Grüßen / Regards

Sven Kieske
Systementwickler
 
 
Mittwald CM Service GmbH & Co. KG
Königsberger Straße 4-6
32339 Espelkamp
 
Tel.: 05772 / 293-900
Fax: 05772 / 293-333
 
https://www.mittwald.de
 
Geschäftsführer: Robert Meyer, Florian Jürgens
 
St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen
Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen

Informationen zur Datenverarbeitung im Rahmen unserer Geschäftstätigkeit 
gemäß Art. 13-14 DSGVO sind unter www.mittwald.de/ds abrufbar.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210811/fd2f9b9c/attachment.sig>


More information about the openstack-discuss mailing list