[magnum] [kolla-ansible] [kayobe] [Victoria]

newer
[Ironic] No weekly meeting on Sep...

older
[all][tc] Technical Committee next...

Tony Pearce

4 Aug 2021 4 Aug '21

11:22 p.m.

Testing out Kubernetes with Magnum project, deployed via kayobe on Victoria we have deployed an auto scaling cluster and have run into a problem and I'm not sure how to proceed. I understand that the cluster tried to scale up but the openstack project did not have enough CPU resources to accommodate it (error= Quota exceeded for cores: Requested 4, but already used 20 of 20 cores). So the situation is that the cluster shows "healthy" and "UPDATE_FAILED" but also kubectl commands are failing [1]. What is required to return the cluster back to a working status at this point? I have tried: - cluster resize to reduce number of workers - cluster resize to increase number of workers after increasing project quota - cluster resize and maintaining the same number of workers When trying any of the above, horizon shows an immediate error "Unable to resize given cluster" but magnum logs and heat logs do not show any log update at all at that time. - using "check stack" and resume stack in the stack horizon menu gives this error [2] Investigating the kubectl issue, it was noted that some services had failed on the master node [3]. Manual start as well as reboot the node did not bring up the services. Unfortunately I dont have ssh access to the master and no further information has been forthcoming with regards to logs for those service failures so I am unable to provide anything around that here. I found this link [4] so I decided to delete the master node then run "check" cluster again but the check cluster just fails in the same way except this time it fails saying that it cannot find the master [5] while the previous error was that it could not find a node. Ideally I would prefer to recover the cluster - whether this is still possible I am unsure. I can probably recreate this scenario again. What steps should be performed in this case to restore the cluster? [1] kubectl get no Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) [2] Resource CHECK failed: ["['NotFound: resources[4].resources.kube-minion: Instance None could not be found. (HTTP 404) (Request-ID: req-6069ff6a-9eb6-4bce-bb25-4ef001ebc428)']. 'CHECK' not fully supported (see resources)"] [3] [systemd] Failed Units: 3 etcd.service heat-container-agent.service logrotate.service [4] https://bugzilla.redhat.com/show_bug.cgi?id=1459854 [5] ["['NotFound: resources.kube_masters.resources[0].resources.kube-master: Instance c6185e8e-1a98-4925-959b-0a56210b8c9e could not be found. (HTTP 404) (Request-ID: req-bdfcc853-7dbb-4022-9208-68b1ab31008a)']. 'CHECK' not fully supported (see resources)"]. Kind regards, Tony Pearce

Attachments:

attachment.html (text/html — 3.0 KB)

Show replies by date

Tony Pearce

11 Aug 11 Aug

12:04 a.m.

New subject: [magnum] [kolla-ansible] [kayobe] [Victoria] Magnum Kubernetes cluster failure recovery

I sent this mail last week looking for some insight with regards to a magnum issue we had. I hadnt seen any reply and searched for my sent mail - I found I did not complete the subject line. Sorry about that. Resending again here with a subject. If anyone has any insight to this I'd be grateful to hear from you. Kind regards, Tony Pearce ---------- Forwarded message --------- From: Tony Pearce <tonyppe@gmail.com> Date: Thu, 5 Aug 2021 at 14:22 Subject: [magnum] [kolla-ansible] [kayobe] [Victoria] To: OpenStack Discuss <openstack-discuss@lists.openstack.org> Testing out Kubernetes with Magnum project, deployed via kayobe on Victoria we have deployed an auto scaling cluster and have run into a problem and I'm not sure how to proceed. I understand that the cluster tried to scale up but the openstack project did not have enough CPU resources to accommodate it (error= Quota exceeded for cores: Requested 4, but already used 20 of 20 cores). So the situation is that the cluster shows "healthy" and "UPDATE_FAILED" but also kubectl commands are failing [1]. What is required to return the cluster back to a working status at this point? I have tried: - cluster resize to reduce number of workers - cluster resize to increase number of workers after increasing project quota - cluster resize and maintaining the same number of workers When trying any of the above, horizon shows an immediate error "Unable to resize given cluster" but magnum logs and heat logs do not show any log update at all at that time. - using "check stack" and resume stack in the stack horizon menu gives this error [2] Investigating the kubectl issue, it was noted that some services had failed on the master node [3]. Manual start as well as reboot the node did not bring up the services. Unfortunately I dont have ssh access to the master and no further information has been forthcoming with regards to logs for those service failures so I am unable to provide anything around that here. I found this link [4] so I decided to delete the master node then run "check" cluster again but the check cluster just fails in the same way except this time it fails saying that it cannot find the master [5] while the previous error was that it could not find a node. Ideally I would prefer to recover the cluster - whether this is still possible I am unsure. I can probably recreate this scenario again. What steps should be performed in this case to restore the cluster? [1] kubectl get no Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes) [2] Resource CHECK failed: ["['NotFound: resources[4].resources.kube-minion: Instance None could not be found. (HTTP 404) (Request-ID: req-6069ff6a-9eb6-4bce-bb25-4ef001ebc428)']. 'CHECK' not fully supported (see resources)"] [3] [systemd] Failed Units: 3 etcd.service heat-container-agent.service logrotate.service [4] https://bugzilla.redhat.com/show_bug.cgi?id=1459854 [5] ["['NotFound: resources.kube_masters.resources[0].resources.kube-master: Instance c6185e8e-1a98-4925-959b-0a56210b8c9e could not be found. (HTTP 404) (Request-ID: req-bdfcc853-7dbb-4022-9208-68b1ab31008a)']. 'CHECK' not fully supported (see resources)"]. Kind regards, Tony Pearce

feilong

2:06 a.m.

New subject: [magnum] [kolla-ansible] [kayobe] [Victoria] Magnum Kubernetes cluster failure recovery

Hi Tony, If I understand correctly, now you're Magnum env can create k8s cluster successfully. But the auto scaling failure caused the update_failed status, is it? If so, cluster resize should be able to bring the cluster back. And you can just resize the cluster to the current node number. For that case, magnum should be able to fix the heat stack. If you failed with resize, then better check the heat log to understand why the heat stack update failed. On 11/08/21 7:04 pm, Tony Pearce wrote:

...

I sent this mail last week looking for some insight with regards to a magnum issue we had. I hadnt seen any reply and searched for my sent mail - I found I did not complete the subject line. Sorry about that.

Resending again here with a subject. If anyone has any insight to this I'd be grateful to hear from you.

Kind regards,

Tony Pearce

---------- Forwarded message --------- From: *Tony Pearce* <tonyppe@gmail.com <mailto:tonyppe@gmail.com>> Date: Thu, 5 Aug 2021 at 14:22 Subject: [magnum] [kolla-ansible] [kayobe] [Victoria] To: OpenStack Discuss <openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org>>

Testing out Kubernetes with Magnum project, deployed via kayobe on Victoria we have deployed an auto scaling cluster and have run into a problem and I'm not sure how to proceed. I understand that the cluster tried to scale up but the openstack project did not have enough CPU resources to accommodate it (error= Quota exceeded for cores: Requested 4, but already used 20 of 20 cores).

So the situation is that the cluster shows "healthy" and "UPDATE_FAILED" but also kubectl commands are failing [1].

What is required to return the cluster back to a working status at this point? I have tried: - cluster resize to reduce number of workers - cluster resize to increase number of workers after increasing project quota - cluster resize and maintaining the same number of workers

When trying any of the above, horizon shows an immediate error "Unable to resize given cluster" but magnum logs and heat logs do not show any log update at all at that time.

- using "check stack" and resume stack in the stack horizon menu gives this error [2]

Investigating the kubectl issue, it was noted that some services had failed on the master node [3]. Manual start as well as reboot the node did not bring up the services. Unfortunately I dont have ssh access to the master and no further information has been forthcoming with regards to logs for those service failures so I am unable to provide anything around that here.

I found this link [4] so I decided to delete the master node then run "check" cluster again but the check cluster just fails in the same way except this time it fails saying that it cannot find the master [5] while the previous error was that it could not find a node.

Ideally I would prefer to recover the cluster - whether this is still possible I am unsure. I can probably recreate this scenario again. What steps should be performed in this case to restore the cluster?

[1] kubectl get no Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)

[2] Resource CHECK failed: ["['NotFound: resources[4].resources.kube-minion: Instance None could not be found. (HTTP 404) (Request-ID: req-6069ff6a-9eb6-4bce-bb25-4ef001ebc428)']. 'CHECK' not fully supported (see resources)"]

[3]

[systemd] Failed Units: 3 etcd.service heat-container-agent.service logrotate.service

[4] https://bugzilla.redhat.com/show_bug.cgi?id=1459854

[5]

["['NotFound: resources.kube_masters.resources[0].resources.kube-master: Instance c6185e8e-1a98-4925-959b-0a56210b8c9e could not be found. (HTTP 404) (Request-ID: req-bdfcc853-7dbb-4022-9208-68b1ab31008a)']. 'CHECK' not fully supported (see resources)"].

Kind regards,

Tony Pearce

-- Cheers & Best regards, ------------------------------------------------------------------------------ Feilong Wang (王飞龙) (he/him) Head of Research & Development Catalyst Cloud Aotearoa's own Mob: +64 21 0832 6348 | www.catalystcloud.nz Level 6, 150 Willis Street, Wellington 6011, New Zealand CONFIDENTIALITY NOTICE: This email is intended for the named recipients only. It may contain privileged, confidential or copyright information. If you are not the named recipient, any use, reliance upon, disclosure or copying of this email or its attachments is unauthorised. If you have received this email in error, please reply via email or call +64 21 0832 6348. ------------------------------------------------------------------------------

Sven Kieske

3:16 a.m.

New subject: [magnum] [kolla-ansible] [kayobe] [Victoria] Magnum Kubernetes cluster failure recovery

Hi Tony, we don't run victoria release but maybe I can give you some pointers where to look: as far as I understand Magnum and the kubernetes autoscaler, magnum uses heat to create stacks for the initial kubernetes deployment. the problem is, that the kubernetes autoscaler directly talks to the openstack api, e.g. nova for creating and destroying instances. This can result in some weird situations, e.g. the autoscaler deletes volumes but heat never is involved, so heat still thinks a volume is there which isn't. so you might want to check all resources in your magnum heat stacks, if they are really there, or if autoscaler did things to them. if you e.g. find resources deleted by autoscaler, mark the appropriate heat stack as unhealthy and trigger a stack update, so heat can do it's thing. the stack should then return in a healthy status. if someone has a solution to this general problem, I would be very interested (beside the obvious "solution" to just disable the autoscaler)! HTH Sven On Mi, 2021-08-11 at 15:04 +0800, Tony Pearce wrote:

...

I sent this mail last week looking for some insight with regards to a magnum issue we had. I hadnt seen any reply and searched for my sent mail - I found I did not complete the subject line. Sorry about that.

Resending again here with a subject. If anyone has any insight to this I'd be grateful to hear from you.

Kind regards,

Tony Pearce

---------- Forwarded message --------- From: Tony Pearce <tonyppe@gmail.com> Date: Thu, 5 Aug 2021 at 14:22 Subject: [magnum] [kolla-ansible] [kayobe] [Victoria] To: OpenStack Discuss <openstack-discuss@lists.openstack.org>

Testing out Kubernetes with Magnum project, deployed via kayobe on Victoria we have deployed an auto scaling cluster and have run into a problem and I'm not sure how to proceed. I understand that the cluster tried to scale up but the openstack project did not have enough CPU resources to accommodate it (error= Quota exceeded for cores: Requested 4, but already used 20 of 20 cores).

So the situation is that the cluster shows "healthy" and "UPDATE_FAILED" but also kubectl commands are failing [1].

What is required to return the cluster back to a working status at this point? I have tried: - cluster resize to reduce number of workers - cluster resize to increase number of workers after increasing project quota - cluster resize and maintaining the same number of workers

When trying any of the above, horizon shows an immediate error "Unable to resize given cluster" but magnum logs and heat logs do not show any log update at all at that time.

- using "check stack" and resume stack in the stack horizon menu gives this error [2]

Investigating the kubectl issue, it was noted that some services had failed on the master node [3]. Manual start as well as reboot the node did not bring up the services. Unfortunately I dont have ssh access to the master and no further information has been forthcoming with regards to logs for those service failures so I am unable to provide anything around that here.

I found this link [4] so I decided to delete the master node then run "check" cluster again but the check cluster just fails in the same way except this time it fails saying that it cannot find the master [5] while the previous error was that it could not find a node.

Ideally I would prefer to recover the cluster - whether this is still possible I am unsure. I can probably recreate this scenario again. What steps should be performed in this case to restore the cluster?

[1] kubectl get no Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)

[2] Resource CHECK failed: ["['NotFound: resources[4].resources.kube-minion: Instance None could not be found. (HTTP 404) (Request-ID: req-6069ff6a-9eb6-4bce-bb25-4ef001ebc428)']. 'CHECK' not fully supported (see resources)"]

[3]

[systemd] Failed Units: 3 etcd.service heat-container-agent.service logrotate.service

[4] https://bugzilla.redhat.com/show_bug.cgi?id=1459854

[5]

["['NotFound: resources.kube_masters.resources[0].resources.kube-master: Instance c6185e8e-1a98-4925-959b-0a56210b8c9e could not be found. (HTTP 404) (Request-ID: req-bdfcc853-7dbb-4022-9208-68b1ab31008a)']. 'CHECK' not fully supported (see resources)"].

Kind regards,

Tony Pearce

-- Mit freundlichen Grüßen / Regards Sven Kieske Systementwickler Mittwald CM Service GmbH & Co. KG Königsberger Straße 4-6 32339 Espelkamp Tel.: 05772 / 293-900 Fax: 05772 / 293-333 https://www.mittwald.de Geschäftsführer: Robert Meyer, Florian Jürgens St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen Informationen zur Datenverarbeitung im Rahmen unserer Geschäftstätigkeit gemäß Art. 13-14 DSGVO sind unter www.mittwald.de/ds abrufbar.

Sven Kieske

3:25 a.m.

New subject: [magnum] [kolla-ansible] [kayobe] [Victoria] Magnum Kubernetes cluster failure recovery

On Mi, 2021-08-11 at 10:16 +0000, Sven Kieske wrote:

...

the problem is, that the kubernetes autoscaler directly talks to the openstack api, e.g. nova for creating and destroying instances.

Nevermind I got that wrong. The autoscaler talks to heat, so there should no problem (but heat trips itself up on some error conditions). I was in fact talking about the magnum auto healer (https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/magn... ) which seems to circumvent heat and talks directly with nova. Are you using the magnum auto healing feature by chance? HTH -- Mit freundlichen Grüßen / Regards Sven Kieske Systementwickler Mittwald CM Service GmbH & Co. KG Königsberger Straße 4-6 32339 Espelkamp Tel.: 05772 / 293-900 Fax: 05772 / 293-333 https://www.mittwald.de Geschäftsführer: Robert Meyer, Florian Jürgens St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen Informationen zur Datenverarbeitung im Rahmen unserer Geschäftstätigkeit gemäß Art. 13-14 DSGVO sind unter www.mittwald.de/ds abrufbar.

feilong

11:41 a.m.

New subject: [magnum] [kolla-ansible] [kayobe] [Victoria] Magnum Kubernetes cluster failure recovery

Let me try to explain it from a design perspective: 1. Auto scaler: Now cluster auto scaler talks to Magnum resize API directly to scale, see https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/clou... 2. Auto healer: As you know auto scaler only cares about the worker node, it won't scale the master nodes. However, auto healer can repair both master nodes and worker nodes. With worker nodes repairing, Magnum auto healer uses magnum resize API. But because the magnum resize api doesn't support master nodes resizing, so the master nodes repairing is done by Heat stack update. magnum auto healer will mark some resources of the master node as unhealthy, then call Heat stack update to rebuild those resources. On 11/08/21 10:25 pm, Sven Kieske wrote:

...

On Mi, 2021-08-11 at 10:16 +0000, Sven Kieske wrote:

...
the problem is, that the kubernetes autoscaler directly talks to the openstack api, e.g. nova for creating and destroying instances. Nevermind I got that wrong.

The autoscaler talks to heat, so there should no problem (but heat trips itself up on some error conditions). I was in fact talking about the magnum auto healer (https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/magn... ) which seems to circumvent heat and talks directly with nova.

Are you using the magnum auto healing feature by chance?

HTH

Tony Pearce

1 Sep 1 Sep

8:50 p.m.

New subject: [magnum] [kolla-ansible] [kayobe] [Victoria] Magnum Kubernetes cluster failure recovery

Thanks Feilong and Sven.

...

If so, cluster resize should be able to bring the cluster back. And you can just resize the cluster to the current node number. For that case, magnum should be able to fix the heat stack.

I thought this too. But when I try and run "check stack" under heat it fails. The log for this failure is that the resource is missing, ie one of the nodes is not there (which I know about). I tried the cluster resize from horizon, to resize the cluster to the valid size / current size (without the additional node failure which is not there) and horizon immediately fails this with a red error in the corner of the web page. There's no log printed within magnum or heat logs at all. And the horizon error is not really helpful with error "*Error: *Unable to resize given cluster id: 1a8e1ed9-64b3-41b1-ab11-0f01e66da1d7.".

...

Are you using the magnum auto healing feature by chance?

The "repair unhealthy nodes" option was chosen for this I believe. But I didnt set up the cluster so I am not sure. Based on your replies, I discovered how to initiate the cluster resize using the CLI. After issuing the command, the missing node was rebuilt immediately. This then appears like some sort of issue with horizon only. I wanted to get the resized cluster operating successfully before I replied, but though it re-deployed the missing node, the cluster resize went timed out and failed. Aside from a quick 30 min investigation on this I've not been able to do much more with that and it's been abandoned. Thanks all the same for your help. Tony Pearce On Thu, 12 Aug 2021 at 05:06, feilong <feilong@catalystcloud.nz> wrote:

...

Let me try to explain it from a design perspective:

1. Auto scaler: Now cluster auto scaler talks to Magnum resize API directly to scale, see

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/clou...

2. Auto healer: As you know auto scaler only cares about the worker node, it won't scale the master nodes. However, auto healer can repair both master nodes and worker nodes. With worker nodes repairing, Magnum auto healer uses magnum resize API. But because the magnum resize api doesn't support master nodes resizing, so the master nodes repairing is done by Heat stack update. magnum auto healer will mark some resources of the master node as unhealthy, then call Heat stack update to rebuild those resources.

On 11/08/21 10:25 pm, Sven Kieske wrote:

...
On Mi, 2021-08-11 at 10:16 +0000, Sven Kieske wrote:

...
the problem is, that the kubernetes autoscaler directly talks to the openstack api, e.g. nova for creating and destroying instances. Nevermind I got that wrong.

The autoscaler talks to heat, so there should no problem (but heat trips itself up on some error conditions). I was in fact talking about the magnum auto healer ( https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/magn... ) which seems to circumvent heat and talks directly with nova.

Are you using the magnum auto healing feature by chance?

HTH

-- Cheers & Best regards,

------------------------------------------------------------------------------ Feilong Wang (王飞龙) (he/him) Head of Research & Development

Catalyst Cloud Aotearoa's own

Mob: +64 21 0832 6348 | www.catalystcloud.nz Level 6, 150 Willis Street, Wellington 6011, New Zealand

CONFIDENTIALITY NOTICE: This email is intended for the named recipients only. It may contain privileged, confidential or copyright information. If you are not the named recipient, any use, reliance upon, disclosure or copying of this email or its attachments is unauthorised. If you have received this email in error, please reply via email or call +64 21 0832 6348.

------------------------------------------------------------------------------

1508

Age (days ago)

1536

Last active (days ago)

List overview

Download

6 comments

3 participants

participants (3)

feilong
Sven Kieske
Tony Pearce