Antelope Magnum: creating a new cluster makes the/some others unhealthy
Hi, I have been playing with Magnum (Antelope), creating a lot of clusters with the same template and varying the number of masters and/or nodes, having ~10 clusters started in the same project. I observed recently that cluster configuration that used to work well were not working any more (CREATE_FAILED during the master deployment in general). I have not found any evidence of the cause digging in various logs but I observed that with the current list of active clusters, if add a new one (even with a configuration as minimal as 1 master and 1 node), not only its creation fails but also the other (or at least) running clusters become unhealthy. If I delete the failed cluster, the other clusters return to the healthy state. It seems very reproducible (I did it more than 10 times) but I still don't see any message in the logs that could help to identify the cause. I've the feeling (but may be wrong) that it is related to either an insufficient project quota or an insufficient limit on one of the OpenStack service. Any idea on a possible cause or any advice on where to look for some information? Thanks in advance. Michel
An aside question: what is running the health status check and is there a way to force it to run again? Michel Le 31/05/2024 à 16:05, Michel Jouvin a écrit :
Hi,
I have been playing with Magnum (Antelope), creating a lot of clusters with the same template and varying the number of masters and/or nodes, having ~10 clusters started in the same project. I observed recently that cluster configuration that used to work well were not working any more (CREATE_FAILED during the master deployment in general). I have not found any evidence of the cause digging in various logs but I observed that with the current list of active clusters, if add a new one (even with a configuration as minimal as 1 master and 1 node), not only its creation fails but also the other (or at least) running clusters become unhealthy. If I delete the failed cluster, the other clusters return to the healthy state. It seems very reproducible (I did it more than 10 times) but I still don't see any message in the logs that could help to identify the cause.
I've the feeling (but may be wrong) that it is related to either an insufficient project quota or an insufficient limit on one of the OpenStack service. Any idea on a possible cause or any advice on where to look for some information?
Thanks in advance.
Michel
Conversely to what I was saying initially, if creating or deleting a cluster seems to cause some update in the health state of other clusters, it doesn't seem to be the cause. I have seen that it is changing quite regularly on a test cloud with no activity and I'm really wondering what could be the cause for this? I don't seen anything in OpenStack config/logs to explain that. A network issue? Michel Le 31/05/2024 à 16:32, Michel Jouvin a écrit :
An aside question: what is running the health status check and is there a way to force it to run again?
Michel
Le 31/05/2024 à 16:05, Michel Jouvin a écrit :
Hi,
I have been playing with Magnum (Antelope), creating a lot of clusters with the same template and varying the number of masters and/or nodes, having ~10 clusters started in the same project. I observed recently that cluster configuration that used to work well were not working any more (CREATE_FAILED during the master deployment in general). I have not found any evidence of the cause digging in various logs but I observed that with the current list of active clusters, if add a new one (even with a configuration as minimal as 1 master and 1 node), not only its creation fails but also the other (or at least) running clusters become unhealthy. If I delete the failed cluster, the other clusters return to the healthy state. It seems very reproducible (I did it more than 10 times) but I still don't see any message in the logs that could help to identify the cause.
I've the feeling (but may be wrong) that it is related to either an insufficient project quota or an insufficient limit on one of the OpenStack service. Any idea on a possible cause or any advice on where to look for some information?
Thanks in advance.
Michel
On 1/6/2024 2:08 am, Michel Jouvin wrote:
Conversely to what I was saying initially, if creating or deleting a cluster seems to cause some update in the health state of other clusters, it doesn't seem to be the cause. I have seen that it is changing quite regularly on a test cloud with no activity and I'm really wondering what could be the cause for this? I don't seen anything in OpenStack config/logs to explain that. A network issue?
Michel
Le 31/05/2024 à 16:32, Michel Jouvin a écrit :
An aside question: what is running the health status check and is there a way to force it to run again?
Michel
Hi Jouvin, You have no indicated what driver you may be using, is this the k8s_fedora_coreos_v1 driver? If so, health checks are done in a period loop by the conductors. They need to be able to poll the /healthz endpoint of your kubernetes api server. You can check with curl -k https://<API_IP>:6443/healthz, where https://<API_IP>:6443 is the server in your kubeconfig. - Jake
Hi Juke, Thanks! Yes I am using k8s_fedora_coreos_v1 driver. I'll check with curl as you suggested. Does it make sense that a check (or check status) could be updated when another cluster is created or deleted? I hardly believed it but it seems there is some "correlation"... Best regards, Michel Sent from my mobile Le 31 mai 2024 18:29:59 Jake Yip <jake.yip@ardc.edu.au> a écrit :
On 1/6/2024 2:08 am, Michel Jouvin wrote:
Conversely to what I was saying initially, if creating or deleting a cluster seems to cause some update in the health state of other clusters, it doesn't seem to be the cause. I have seen that it is changing quite regularly on a test cloud with no activity and I'm really wondering what could be the cause for this? I don't seen anything in OpenStack config/logs to explain that. A network issue?
Michel
Le 31/05/2024 à 16:32, Michel Jouvin a écrit :
An aside question: what is running the health status check and is there a way to force it to run again?
Michel
Hi Jouvin,
You have no indicated what driver you may be using, is this the k8s_fedora_coreos_v1 driver?
If so, health checks are done in a period loop by the conductors. They need to be able to poll the /healthz endpoint of your kubernetes api server. You can check with curl -k https://<API_IP>:6443/healthz, where https://<API_IP>:6443 is the server in your kubeconfig.
- Jake
Hi, I progressed a little bit on the flipping health status. When a cluster becomes unhealthy, the curl command returns: ----- $ curl --insecure https://157.136.248.202:6443/healthz [+]ping ok [+]log ok [-]etcd failed: reason withheld ... (all others ok) ----- It lasts a few minutes and then the clusters affected become healthy again. It seems to happen several clusters at the same time. Sometimes it can remains healthy only for a few seconds/minutes and be unhealthy again and so on... It seems that the cluster (kubectl) tends to be unresponsive when transitioning from one state to the other but curl always responds... Looks as something that becomes unresponsive during some time... Any suggestion welcome! Michel Le 31/05/2024 à 18:43, Michel Jouvin a écrit :
Hi Juke,
Thanks! Yes I am using k8s_fedora_coreos_v1 driver. I'll check with curl as you suggested. Does it make sense that a check (or check status) could be updated when another cluster is created or deleted? I hardly believed it but it seems there is some "correlation"...
Best regards,
Michel Sent from my mobile
Le 31 mai 2024 18:29:59 Jake Yip <jake.yip@ardc.edu.au> a écrit :
On 1/6/2024 2:08 am, Michel Jouvin wrote:
Conversely to what I was saying initially, if creating or deleting a cluster seems to cause some update in the health state of other clusters, it doesn't seem to be the cause. I have seen that it is changing quite regularly on a test cloud with no activity and I'm really wondering what could be the cause for this? I don't seen anything in OpenStack config/logs to explain that. A network issue?
Michel
Le 31/05/2024 à 16:32, Michel Jouvin a écrit :
An aside question: what is running the health status check and is there a way to force it to run again?
Michel
Hi Jouvin,
You have no indicated what driver you may be using, is this the k8s_fedora_coreos_v1 driver?
If so, health checks are done in a period loop by the conductors. They need to be able to poll the /healthz endpoint of your kubernetes api server. You can check with curl -k https://<API_IP>:6443/healthz, where https://<API_IP>:6443 is the server in your kubeconfig.
- Jake
On 3/6/2024 4:29 am, Michel Jouvin wrote:
Hi,
I progressed a little bit on the flipping health status. When a cluster becomes unhealthy, the curl command returns:
----- $ curl --insecure https://157.136.248.202:6443/healthz [+]ping ok [+]log ok [-]etcd failed: reason withheld ... (all others ok) -----
It lasts a few minutes and then the clusters affected become healthy again. It seems to happen several clusters at the same time. Sometimes it can remains healthy only for a few seconds/minutes and be unhealthy again and so on... It seems that the cluster (kubectl) tends to be unresponsive when transitioning from one state to the other but curl always responds... Looks as something that becomes unresponsive during some time...
Any suggestion welcome!
Michel
Is this a cluster with multiple control plane nodes? If so, you may want to check etcd logs - they have quite a low latency requirements. Etcd website will tell you more. You can also take a look at kube-apiserver logs when things are transiting. You can check both kube-apiserver and etcd by SSH-ing to the control plane nodes, using core@<fip> Regards, Jake
participants (2)
-
Jake Yip
-
Michel Jouvin