[kolla-ansible][wallaby][magnum][Kubernetes] Cannot auto-scale workers
Hi, I have a new kolla-ansible deployment with wallaby. I have created a kubernetes cluster using calico (flannel didn't work for me). I configured an autoscale test to see if it works. - pods autoscale is working. - worker nodes autoscale is not working. This is my deployment file :*cat php-apache.yaml* apiVersion: apps/v1 kind: Deployment metadata: name: php-apache-deployment spec: selector: matchLabels: app: php-apache replicas: 2 template: metadata: labels: app: php-apache spec: containers: - name: php-apache image: k8s.gcr.io/hpa-example ports: - containerPort: 80 resources: limits: cpu: 500m requests: cpu: 200m --- apiVersion: v1 kind: Service metadata: name: php-apache-service labels: app: php-apache spec: ports: - port: 80 targetPort: 80 protocol: TCP selector: app: php-apache type: LoadBalancer This is my HPA file :*cat php-apache-hpa.yaml* apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: php-apache-hpa namespace: default labels: service: php-apache-service spec: minReplicas: 2 maxReplicas: 30 scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: php-apache-deployment metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 30 # en pourcentage This is my load program : kubectl run -i --tty load-generator-1 --rm --image=busybox --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://ip_load_balancer; done" Here are the output of my kub cluster before the test : [kube8@cdndeployer ~]$ kubectl get pod NAME READY STATUS RESTARTS AGE php-apache-deployment-5b65bbc75c-95k6k 1/1 Running 0 24m php-apache-deployment-5b65bbc75c-mv5h6 1/1 Running 0 24m [kube8@cdndeployer ~]$ kubectl get hpa NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE php-apache-hpa Deployment/php-apache-deployment *0%/30%* 2 15 2 24m [kube8@cdndeployer ~]$ kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP <none> 443/TCP 13h php-apache-service LoadBalancer *xx.xx.xx.213* 80:31763/TCP 25m When I apply the load : - pods autoscale creates new pods, then some of them get in the state of : *pending * [kube8@cdndeployer ~]$ kubectl get hpa NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE php-apache-hpa Deployment/php-apache-deployment *155%/30%* 2 15 4 27m [kube8@cdndeployer ~]$ kubectl get pod NAME READY STATUS RESTARTS AGE load-generator-1 1/1 Running 0 97s load-generator-2 1/1 Running 0 94s php-apache-deployment-5b65bbc75c-95k6k 1/1 Running 0 28m *php-apache-deployment-5b65bbc75c-cjkwk 0/1 Pending 0 33s* *php-apache-deployment-5b65bbc75c-cn5rt 0/1 Pending 0 33s* *php-apache-deployment-5b65bbc75c-cxctx 0/1 Pending 0 48s* php-apache-deployment-5b65bbc75c-fffnc 1/1 Running 0 64s php-apache-deployment-5b65bbc75c-hbfw8 0/1 Pending 0 33s php-apache-deployment-5b65bbc75c-l8496 1/1 Running 0 48s php-apache-deployment-5b65bbc75c-mv5h6 1/1 Running 0 28m php-apache-deployment-5b65bbc75c-qddrb 1/1 Running 0 48s php-apache-deployment-5b65bbc75c-dd5r5 0/1 Pending 0 48s php-apache-deployment-5b65bbc75c-tr65j 1/1 Running 0 64s 2 - The cluster is unable to create more pods/workers and I get this error message from the pending pods kubectl describe pod php-apache-deployment-5b65bbc75c-dd5r5 Name: php-apache-deployment-5b65bbc75c-dd5r5 Namespace: default Priority: 0 Node: <none> Labels: app=php-apache pod-template-hash=5b65bbc75c Annotations: kubernetes.io/psp: magnum.privileged *Status: Pending* IP: IPs: <none> Controlled By: ReplicaSet/php-apache-deployment-5b65bbc75c Containers: php-apache: Image: k8s.gcr.io/hpa-example Port: 80/TCP Host Port: 0/TCP Limits: cpu: 500m Requests: cpu: 200m Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-4fsgh (ro) Conditions: Type Status PodScheduled False Volumes: default-token-4fsgh: Type: Secret (a volume populated by a Secret) SecretName: default-token-4fsgh Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- * Warning FailedScheduling 2m48s default-scheduler 0/4 nodes are available: 2 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master <http://node-role.kubernetes.io/master>: }, that the pod didn't tolerate.* * Warning FailedScheduling 2m48s default-scheduler 0/4 nodes are available: 2 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master <http://node-role.kubernetes.io/master>: }, that the pod didn't tolerate.* * Normal NotTriggerScaleUp 2m42s cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 in backoff after failed scale-u**p* I have this error message from the autoscaller pod *cluster-autoscaler*-f4bd5f674-b9692 : I1123 00:50:27.714801 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 12.709µs I1123 00:51:34.181145 1 scale_up.go:658] Scale-up: setting group default-worker size to 3 *W1123 00:51:34.381953 1 clusterstate.go:281] Disabling scale-up for node group default-worker until 2021-11-23 00:56:34.180840351 +0000 UTC m=+47174.376164120; errorClass=Other; errorCode=cloudProviderError* *E1123 00:51:34.382081 1 static_autoscaler.go:415] Failed to scale up: failed to increase node group size: could not check current nodegroup size: could not get cluster: Get https://dash.cdn.domaine.tld:9511/v1/clusters/b4a6b3eb-fcf3-416f-b740-11a083... <https://dash.cdn.cerist.dz:9511/v1/clusters/b4a6b3eb-fcf3-416f-b740-11a083d4b896>: dial tcp: lookup dash.cdn.domaine. <http://dash.cdn.cerist.dz>tld on <>: no such host* W1123 00:51:44.392523 1 scale_up.go:383] Node group default-worker is not ready for scaleup - backoff W1123 00:51:54.410273 1 scale_up.go:383] Node group default-worker is not ready for scaleup - backoff W1123 00:52:04.422128 1 scale_up.go:383] Node group default-worker is not ready for scaleup - backoff W1123 00:52:14.434278 1 scale_up.go:383] Node group default-worker is not ready for scaleup - backoff W1123 00:52:24.442480 1 scale_up.go:383] Node group default-worker is not ready for scaleup - backoff I1123 00:52:27.715019 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache I did some tests on the DNS pod and : kubectl get svc -A NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE default kubernetes ClusterIP <none> 443/TCP 13h default php-apache-service LoadBalancer xx.xx.xx.213 80:31763/TCP 19m kube-system dashboard-metrics-scraper ClusterIP <none> 8000/TCP 13h *kube-system kube-dns ClusterIP <none> 53/UDP,53/TCP,9153/TCP 13h* kube-system kubernetes-dashboard ClusterIP <none> 443/TCP 13h kube-system magnum-metrics-server ClusterIP <none> 443/TCP 13h I have noticed this behaviour about the horizon url, sometimes the dns pod responds sometimes it does not !!!!! [root@k8multiclustercalico-ve5t6uuoo245-master-0 ~]# *dig @ <> dash.cdn. <http://dash.cdn.cerist.dz>domaine.tld* ; <<>> DiG 9.11.28-RedHat-9.11.28-1.fc33 <<>> @ dash.cdn. <http://dash.cdn.cerist.dz>domaine.tld ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 5646 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;dash.cd <http://dash.cdn.cerist.dz>n.domaine.tld. IN A ;; AUTHORITY SECTION: *cdn. <http://cdn.cerist.dz>domaine.tld. 30 IN SOA cdn. <http://cdn.cerist.dz>domaine.tld. root.cdn. <http://root.cdn.cerist.dz>domaine.tld. 2021100900 604800 86400 2419200 604800* ;; Query time: 84 msec ;; SERVER: ;; WHEN: Tue Nov 23 01:08:03 UTC 2021 ;; MSG SIZE rcvd: 12 2 secondes later [root@k8multiclustercalico-ve5t6uuoo245-master-0 ~]# *dig @ <> dash.cdn. <http://dash.cdn.cerist.dz>domaine.tld* ; <<>> DiG 9.11.28-RedHat-9.11.28-1.fc33 <<>> @ dash.cdn. <http://dash.cdn.cerist.dz>domaine.tld ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7653 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;dash.cdn. <http://dash.cdn.cerist.dz>domaine.tld. IN A ;; ANSWER SECTION: *dash.cdn. <http://dash.cdn.cerist.dz>domaine.tld. 30 IN A xx.xx.xx.129* ;; Query time: 2 msec ;; SERVER: ;; WHEN: Tue Nov 23 01:08:21 UTC 2021 ;; MSG SIZE rcvd: 81 In the log of the dns pod I have this kubectl logs *kube-dns-autoscaler*-75859754fd-q8z4w -n kube-system *E1122 20:56:09.944449 1 autoscaler_server.go:120] Update failure: the server could not find the requested resource E1122 20:56:19.945294 1 autoscaler_server.go:120] Update failure: the server could not find the requested resource E1122 20:56:29.944245 1 autoscaler_server.go:120] Update failure: the server could not find the requested resource E1122 20:56:39.946346 1 autoscaler_server.go:120] Update failure: the server could not find the requested resource* *E1122 20:56:49.944693 1 autoscaler_server.go:120] Update failure: the server could not find the requested resource* I don't have experience on kubernetes yet, could someone help me debug this? Regards.
participants (1)
wodel youchi