Hi,

I have a new kolla-ansible deployment with wallaby.

I have created a kubernetes cluster using calico (flannel didn't work for me).

I configured an autoscale test to see if it works.

- pods autoscale is working.

- worker nodes autoscale is not working.

This is my deployment file :cat php-apache.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
name: php-apache-deployment
spec:
selector:
    matchLabels:
      app: php-apache
replicas: 2
template:
    metadata:
      labels:
        app: php-apache
    spec:
      containers:
      - name: php-apache
        image: k8s.gcr.io/hpa-example
        ports:
        - containerPort: 80
        resources:
          limits:
            cpu: 500m
          requests:
            cpu: 200m
---
apiVersion: v1
kind: Service
metadata:
name: php-apache-service
labels:
    app: php-apache
spec:
ports:
- port: 80
    targetPort: 80
    protocol: TCP
selector:
    app: php-apache
type: LoadBalancer

This is my HPA file :cat php-apache-hpa.yaml

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: php-apache-hpa
namespace: default
labels:
    service: php-apache-service
spec:
minReplicas: 2
maxReplicas: 30
scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: php-apache-deployment
metrics:
- type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 30 # en pourcentage

This is my load program :

kubectl run -i --tty load-generator-1 --rm --image=busybox --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://ip_load_balancer; done"

Here are the output of my kub cluster before the test :

[kube8@cdndeployer ~]$ kubectl get pod
NAME                                     READY   STATUS    RESTARTS   AGE
php-apache-deployment-5b65bbc75c-95k6k   1/1     Running   0          24m
php-apache-deployment-5b65bbc75c-mv5h6   1/1     Running   0          24m

[kube8@cdndeployer ~]$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
php-apache-hpa Deployment/php-apache-deployment 0%/30% 2 15 2 24m

[kube8@cdndeployer ~]$ kubectl get svc
NAME                 TYPE           CLUSTER-IP    EXTERNAL-IP      PORT(S)        AGE
kubernetes           ClusterIP      10.254.0.1    <none>           443/TCP        13h
php-apache-service   LoadBalancer   10.254.3.54   xx.xx.xx.213   80:31763/TCP   25m

When I apply the load :

- pods autoscale creates new pods, then some of them get in the state of : pending

[kube8@cdndeployer ~]$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
php-apache-hpa Deployment/php-apache-deployment 155%/30% 2 15 4 27m

[kube8@cdndeployer ~]$ kubectl get pod
NAME                                     READY   STATUS    RESTARTS   AGE
load-generator-1                         1/1     Running   0          97s
load-generator-2                         1/1     Running   0          94s
php-apache-deployment-5b65bbc75c-95k6k   1/1     Running   0          28m
php-apache-deployment-5b65bbc75c-cjkwk   0/1     Pending   0          33s
php-apache-deployment-5b65bbc75c-cn5rt   0/1     Pending   0          33s
php-apache-deployment-5b65bbc75c-cxctx   0/1     Pending   0          48s
php-apache-deployment-5b65bbc75c-fffnc   1/1     Running   0          64s
php-apache-deployment-5b65bbc75c-hbfw8   0/1     Pending   0          33s
php-apache-deployment-5b65bbc75c-l8496   1/1     Running   0          48s
php-apache-deployment-5b65bbc75c-mv5h6   1/1     Running   0          28m
php-apache-deployment-5b65bbc75c-qddrb   1/1     Running   0          48s
php-apache-deployment-5b65bbc75c-dd5r5   0/1     Pending   0          48s
php-apache-deployment-5b65bbc75c-tr65j   1/1     Running   0          64s

2 - The cluster is unable to create more pods/workers and I get this error message from the pending pods

kubectl describe pod php-apache-deployment-5b65bbc75c-dd5r5
Name:           php-apache-deployment-5b65bbc75c-dd5r5
Namespace:      default
Priority:       0
Node:           <none>
Labels:         app=php-apache
                pod-template-hash=5b65bbc75c
Annotations:    kubernetes.io/psp: magnum.privileged
Status:         Pending
IP:
IPs:            <none>
Controlled By: ReplicaSet/php-apache-deployment-5b65bbc75c
Containers:
php-apache:
    Image:      k8s.gcr.io/hpa-example
    Port:       80/TCP
    Host Port: 0/TCP
    Limits:
      cpu: 500m
    Requests:
      cpu:        200m
    Environment: <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-4fsgh (ro)
Conditions:
Type           Status
PodScheduled   False
Volumes:
default-token-4fsgh:
    Type:        Secret (a volume populated by a Secret)
    SecretName: default-token-4fsgh
    Optional:    false
QoS Class:       Burstable
Node-Selectors: <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type     Reason             Age    From                Message
----     ------             ----   ----                -------
Warning FailedScheduling   2m48s default-scheduler   0/4 nodes are available: 2 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Warning FailedScheduling   2m48s default-scheduler   0/4 nodes are available: 2 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Normal   NotTriggerScaleUp 2m42s cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 in backoff after failed scale-up

I have this error message from the autoscaller pod cluster-autoscaler-f4bd5f674-b9692 :

I1123 00:50:27.714801       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 12.709µs
I1123 00:51:34.181145       1 scale_up.go:658] Scale-up: setting group default-worker size to 3
W1123 00:51:34.381953       1 clusterstate.go:281] Disabling scale-up for node group default-worker until 2021-11-23 00:56:34.180840351 +0000 UTC m=+47174.376164120; errorClass=Other; errorCode=cloudProviderError
E1123 00:51:34.382081       1 static_autoscaler.go:415] Failed to scale up: failed to increase node group size: could not check current nodegroup size: could not get cluster: Get https://dash.cdn.domaine.tld:9511/v1/clusters/b4a6b3eb-fcf3-416f-b740-11a083d4b896: dial tcp: lookup dash.cdn.domaine.tld on 10.254.0.10:53: no such host
W1123 00:51:44.392523       1 scale_up.go:383] Node group default-worker is not ready for scaleup - backoff
W1123 00:51:54.410273       1 scale_up.go:383] Node group default-worker is not ready for scaleup - backoff
W1123 00:52:04.422128       1 scale_up.go:383] Node group default-worker is not ready for scaleup - backoff
W1123 00:52:14.434278       1 scale_up.go:383] Node group default-worker is not ready for scaleup - backoff
W1123 00:52:24.442480       1 scale_up.go:383] Node group default-worker is not ready for scaleup - backoff
I1123 00:52:27.715019       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache

I did some tests on the DNS pod and :

kubectl get svc -A
NAMESPACE     NAME                        TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)                  AGE
default       kubernetes                  ClusterIP      10.254.0.1       <none>           443/TCP                  13h
default       php-apache-service          LoadBalancer   10.254.3.54      xx.xx.xx.213   80:31763/TCP             19m
kube-system   dashboard-metrics-scraper   ClusterIP      10.254.19.191    <none>           8000/TCP                 13h
kube-system   kube-dns                    ClusterIP      10.254.0.10      <none>           53/UDP,53/TCP,9153/TCP   13h
kube-system   kubernetes-dashboard        ClusterIP      10.254.132.17    <none>           443/TCP                  13h
kube-system   magnum-metrics-server       ClusterIP      10.254.235.147   <none>           443/TCP                  13h

I have noticed this behaviour about the horizon url, sometimes the dns pod responds sometimes it does not !!!!!

[root@k8multiclustercalico-ve5t6uuoo245-master-0 ~]# dig @10.254.0.10 dash.cdn.domaine.tld

; <<>> DiG 9.11.28-RedHat-9.11.28-1.fc33 <<>> @10.254.0.10 dash.cdn.domaine.tld
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 5646
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;dash.cdn.domaine.tld. IN A

;; AUTHORITY SECTION:
cdn.domaine.tld. 30 IN SOA cdn.domaine.tld. root.cdn.domaine.tld. 2021100900 604800 86400 2419200 604800

;; Query time: 84 msec
;; SERVER: 10.254.0.10#53(10.254.0.10)
;; WHEN: Tue Nov 23 01:08:03 UTC 2021
;; MSG SIZE rcvd: 12

2 secondes later

[root@k8multiclustercalico-ve5t6uuoo245-master-0 ~]# dig @10.254.0.10 dash.cdn.domaine.tld

; <<>> DiG 9.11.28-RedHat-9.11.28-1.fc33 <<>> @10.254.0.10 dash.cdn.domaine.tld
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7653
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;dash.cdn.domaine.tld. IN A

;; ANSWER SECTION:
dash.cdn.domaine.tld. 30 IN A xx.xx.xx.129

;; Query time: 2 msec
;; SERVER: 10.254.0.10#53(10.254.0.10)
;; WHEN: Tue Nov 23 01:08:21 UTC 2021
;; MSG SIZE rcvd: 81

In the log of the dns pod I have this

kubectl logs kube-dns-autoscaler-75859754fd-q8z4w -n kube-system

E1122 20:56:09.944449       1 autoscaler_server.go:120] Update failure: the server could not find the requested resource
E1122 20:56:19.945294       1 autoscaler_server.go:120] Update failure: the server could not find the requested resource
E1122 20:56:29.944245       1 autoscaler_server.go:120] Update failure: the server could not find the requested resource
E1122 20:56:39.946346       1 autoscaler_server.go:120] Update failure: the server could not find the requested resource
E1122 20:56:49.944693       1 autoscaler_server.go:120] Update failure: the server could not find the requested resource

I don't have experience on kubernetes yet, could someone help me debug this?

Regards.