[kolla-ansible][wallaby][magnum][Kubernetes] Cannot auto-scale workers

24 Nov 2021

      Hi,

I have a new kolla-ansible deployment with wallaby.
I have created a kubernetes cluster using calico (flannel didn't work for
me).

I configured an autoscale test to see if it works.
- pods autoscale is working.
- worker nodes autoscale is not working.

This is my deployment file :*cat php-apache.yaml*

apiVersion: apps/v1
kind: Deployment
metadata:
  name: php-apache-deployment
spec:
  selector:
    matchLabels:
      app: php-apache
  replicas: 2
  template:
    metadata:
      labels:
        app: php-apache
    spec:
      containers:
      - name: php-apache
        image: k8s.gcr.io/hpa-example
        ports:
        - containerPort: 80
        resources:
          limits:
            cpu: 500m
          requests:
            cpu: 200m
---
apiVersion: v1
kind: Service
metadata:
  name: php-apache-service
  labels:
    app: php-apache
spec:
  ports:
  - port: 80
    targetPort: 80
    protocol: TCP
  selector:
    app: php-apache
  type: LoadBalancer

This is my HPA file :*cat php-apache-hpa.yaml*

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache-hpa
  namespace: default
  labels:
    service: php-apache-service
spec:
  minReplicas: 2
  maxReplicas: 30
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: php-apache-deployment
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 30 # en pourcentage

This is my load program :

kubectl run -i --tty load-generator-1 --rm --image=busybox --restart=Never
-- /bin/sh -c "while sleep 0.01; do wget -q -O- http://ip_load_balancer;
done"

Here are the output of my kub cluster before the test :

[kube8@cdndeployer ~]$ kubectl get pod
NAME                                     READY   STATUS    RESTARTS   AGE
php-apache-deployment-5b65bbc75c-95k6k   1/1     Running   0          24m
php-apache-deployment-5b65bbc75c-mv5h6   1/1     Running   0          24m

[kube8@cdndeployer ~]$ kubectl get hpa
NAME             REFERENCE                          TARGETS   MINPODS
MAXPODS   REPLICAS   AGE
php-apache-hpa   Deployment/php-apache-deployment   *0%/30%*    2
15        2          24m

[kube8@cdndeployer ~]$ kubectl get svc
NAME                 TYPE           CLUSTER-IP    EXTERNAL-IP
PORT(S)        AGE
kubernetes           ClusterIP      10.254.0.1    <none>
443/TCP        13h
php-apache-service   LoadBalancer   10.254.3.54   *xx.xx.xx.213*
80:31763/TCP   25m

When I apply the load :

- pods autoscale creates new pods, then some of them get in the state of :
*pending *

[kube8@cdndeployer ~]$ kubectl get hpa
NAME             REFERENCE                          TARGETS    MINPODS
MAXPODS   REPLICAS   AGE
php-apache-hpa   Deployment/php-apache-deployment   *155%/30%*   2
15        4          27m

[kube8@cdndeployer ~]$ kubectl get pod
NAME                                     READY   STATUS    RESTARTS   AGE
load-generator-1                         1/1     Running   0          97s
load-generator-2                         1/1     Running   0          94s
php-apache-deployment-5b65bbc75c-95k6k   1/1     Running   0          28m
*php-apache-deployment-5b65bbc75c-cjkwk   0/1     Pending   0          33s*
*php-apache-deployment-5b65bbc75c-cn5rt   0/1     Pending   0          33s*
*php-apache-deployment-5b65bbc75c-cxctx   0/1     Pending   0          48s*
php-apache-deployment-5b65bbc75c-fffnc   1/1     Running   0          64s
php-apache-deployment-5b65bbc75c-hbfw8   0/1     Pending   0          33s
php-apache-deployment-5b65bbc75c-l8496   1/1     Running   0          48s
php-apache-deployment-5b65bbc75c-mv5h6   1/1     Running   0          28m
php-apache-deployment-5b65bbc75c-qddrb   1/1     Running   0          48s
php-apache-deployment-5b65bbc75c-dd5r5   0/1     Pending   0          48s
php-apache-deployment-5b65bbc75c-tr65j   1/1     Running   0          64s

2 - The cluster is unable to create more pods/workers and I get this error
message from the pending pods

kubectl describe pod php-apache-deployment-5b65bbc75c-dd5r5
Name:           php-apache-deployment-5b65bbc75c-dd5r5
Namespace:      default
Priority:       0
Node:           <none>
Labels:         app=php-apache
                pod-template-hash=5b65bbc75c
Annotations:    kubernetes.io/psp: magnum.privileged
*Status:         Pending*
IP:
IPs:            <none>
Controlled By:  ReplicaSet/php-apache-deployment-5b65bbc75c
Containers:
  php-apache:
    Image:      k8s.gcr.io/hpa-example
    Port:       80/TCP
    Host Port:  0/TCP
    Limits:
      cpu:  500m
    Requests:
      cpu:        200m
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from
default-token-4fsgh (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-4fsgh:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-4fsgh
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason             Age    From                Message
  ----     ------             ----   ----                -------
*  Warning  FailedScheduling   2m48s  default-scheduler   0/4 nodes are
available: 2 Insufficient cpu, 2 node(s) had taint
{node-role.kubernetes.io/master <http://node-role.kubernetes.io/master>: },
that the pod didn't tolerate.*
*  Warning  FailedScheduling   2m48s  default-scheduler   0/4 nodes are
available: 2 Insufficient cpu, 2 node(s) had taint
{node-role.kubernetes.io/master <http://node-role.kubernetes.io/master>: },
that the pod didn't tolerate.*
*  Normal   NotTriggerScaleUp  2m42s  cluster-autoscaler  pod didn't
trigger scale-up (it wouldn't fit if a new node is added): 1 in backoff
after failed scale-u**p*

I have this error message from the autoscaller pod
*cluster-autoscaler*-f4bd5f674-b9692
:

I1123 00:50:27.714801       1 node_instances_cache.go:168] Refresh cloud
provider node instances cache finished, refresh took 12.709µs
I1123 00:51:34.181145       1 scale_up.go:658] Scale-up: setting group
default-worker size to 3
*W1123 00:51:34.381953       1 clusterstate.go:281] Disabling scale-up for
node group default-worker until 2021-11-23 00:56:34.180840351 +0000 UTC
m=+47174.376164120; errorClass=Other; errorCode=cloudProviderError*
*E1123 00:51:34.382081       1 static_autoscaler.go:415] Failed to scale
up: failed to increase node group size: could not check current nodegroup
size: could not get cluster: Get
https://dash.cdn.domaine.tld:9511/v1/clusters/b4a6b3eb-fcf3-416f-b740-11a083...
<https://dash.cdn.cerist.dz:9511/v1/clusters/b4a6b3eb-fcf3-416f-b740-11a083d4b896>:
dial tcp: lookup dash.cdn.domaine. <http://dash.cdn.cerist.dz>tld on
10.254.0.10:53 <http://10.254.0.10:53>: no such host*
W1123 00:51:44.392523       1 scale_up.go:383] Node group default-worker is
not ready for scaleup - backoff
W1123 00:51:54.410273       1 scale_up.go:383] Node group default-worker is
not ready for scaleup - backoff
W1123 00:52:04.422128       1 scale_up.go:383] Node group default-worker is
not ready for scaleup - backoff
W1123 00:52:14.434278       1 scale_up.go:383] Node group default-worker is
not ready for scaleup - backoff
W1123 00:52:24.442480       1 scale_up.go:383] Node group default-worker is
not ready for scaleup - backoff
I1123 00:52:27.715019       1 node_instances_cache.go:156] Start refreshing
cloud provider node instances cache

I did some tests on the DNS pod and :

kubectl get svc -A
NAMESPACE     NAME                        TYPE           CLUSTER-IP
EXTERNAL-IP      PORT(S)                  AGE
default       kubernetes                  ClusterIP      10.254.0.1
<none>           443/TCP                  13h
default       php-apache-service          LoadBalancer   10.254.3.54
xx.xx.xx.213   80:31763/TCP             19m
kube-system   dashboard-metrics-scraper   ClusterIP      10.254.19.191
<none>           8000/TCP                 13h
*kube-system   kube-dns                    ClusterIP      10.254.0.10
<none>           53/UDP,53/TCP,9153/TCP   13h*
kube-system   kubernetes-dashboard        ClusterIP      10.254.132.17
<none>           443/TCP                  13h
kube-system   magnum-metrics-server       ClusterIP      10.254.235.147
<none>           443/TCP                  13h

I have noticed this behaviour about the horizon url, sometimes the dns pod
responds sometimes it does not !!!!!

[root@k8multiclustercalico-ve5t6uuoo245-master-0 ~]# *dig @10.254.0.10
<http://10.254.0.10> dash.cdn. <http://dash.cdn.cerist.dz>domaine.tld*

; <<>> DiG 9.11.28-RedHat-9.11.28-1.fc33 <<>> @10.254.0.10 dash.cdn.
<http://dash.cdn.cerist.dz>domaine.tld
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 5646
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;dash.cd <http://dash.cdn.cerist.dz>n.domaine.tld.            IN      A

;; AUTHORITY SECTION:
*cdn. <http://cdn.cerist.dz>domaine.tld.          30      IN      SOA
cdn. <http://cdn.cerist.dz>domaine.tld. root.cdn.
<http://root.cdn.cerist.dz>domaine.tld. 2021100900 604800 86400 2419200
604800*

;; Query time: 84 msec
;; SERVER: 10.254.0.10#53(10.254.0.10)
;; WHEN: Tue Nov 23 01:08:03 UTC 2021
;; MSG SIZE  rcvd: 12

2 secondes later

[root@k8multiclustercalico-ve5t6uuoo245-master-0 ~]# *dig @10.254.0.10
<http://10.254.0.10> dash.cdn. <http://dash.cdn.cerist.dz>domaine.tld*

; <<>> DiG 9.11.28-RedHat-9.11.28-1.fc33 <<>> @10.254.0.10 dash.cdn.
<http://dash.cdn.cerist.dz>domaine.tld
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7653
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;dash.cdn. <http://dash.cdn.cerist.dz>domaine.tld.            IN      A

;; ANSWER SECTION:
*dash.cdn. <http://dash.cdn.cerist.dz>domaine.tld.     30      IN
A       xx.xx.xx.129*

;; Query time: 2 msec
;; SERVER: 10.254.0.10#53(10.254.0.10)
;; WHEN: Tue Nov 23 01:08:21 UTC 2021
;; MSG SIZE  rcvd: 81

In the log of the dns pod I have this

 kubectl logs *kube-dns-autoscaler*-75859754fd-q8z4w -n kube-system

*E1122 20:56:09.944449       1 autoscaler_server.go:120] Update failure:
the server could not find the requested resource E1122
20:56:19.945294       1 autoscaler_server.go:120] Update failure: the
server could not find the requested resource E1122 20:56:29.944245       1
autoscaler_server.go:120] Update failure: the server could not find the
requested resource E1122 20:56:39.946346       1 autoscaler_server.go:120]
Update failure: the server could not find the requested resource*
*E1122 20:56:49.944693       1 autoscaler_server.go:120] Update failure:
the server could not find the requested resource*

I don't have experience on kubernetes yet, could someone help me debug this?

Regards.

wodel youchi

tags

participants (1)