[kolla-ansible][wallaby][magnum][Kubernetes] Cannot auto-scale workers
wodel youchi
wodel.youchi at gmail.com
Wed Nov 24 08:25:01 UTC 2021
Hi,
I have a new kolla-ansible deployment with wallaby.
I have created a kubernetes cluster using calico (flannel didn't work for
me).
I configured an autoscale test to see if it works.
- pods autoscale is working.
- worker nodes autoscale is not working.
This is my deployment file :*cat php-apache.yaml*
apiVersion: apps/v1
kind: Deployment
metadata:
name: php-apache-deployment
spec:
selector:
matchLabels:
app: php-apache
replicas: 2
template:
metadata:
labels:
app: php-apache
spec:
containers:
- name: php-apache
image: k8s.gcr.io/hpa-example
ports:
- containerPort: 80
resources:
limits:
cpu: 500m
requests:
cpu: 200m
---
apiVersion: v1
kind: Service
metadata:
name: php-apache-service
labels:
app: php-apache
spec:
ports:
- port: 80
targetPort: 80
protocol: TCP
selector:
app: php-apache
type: LoadBalancer
This is my HPA file :*cat php-apache-hpa.yaml*
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: php-apache-hpa
namespace: default
labels:
service: php-apache-service
spec:
minReplicas: 2
maxReplicas: 30
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache-deployment
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 30 # en pourcentage
This is my load program :
kubectl run -i --tty load-generator-1 --rm --image=busybox --restart=Never
-- /bin/sh -c "while sleep 0.01; do wget -q -O- http://ip_load_balancer;
done"
Here are the output of my kub cluster before the test :
[kube8 at cdndeployer ~]$ kubectl get pod
NAME READY STATUS RESTARTS AGE
php-apache-deployment-5b65bbc75c-95k6k 1/1 Running 0 24m
php-apache-deployment-5b65bbc75c-mv5h6 1/1 Running 0 24m
[kube8 at cdndeployer ~]$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS
MAXPODS REPLICAS AGE
php-apache-hpa Deployment/php-apache-deployment *0%/30%* 2
15 2 24m
[kube8 at cdndeployer ~]$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP
PORT(S) AGE
kubernetes ClusterIP 10.254.0.1 <none>
443/TCP 13h
php-apache-service LoadBalancer 10.254.3.54 *xx.xx.xx.213*
80:31763/TCP 25m
When I apply the load :
- pods autoscale creates new pods, then some of them get in the state of :
*pending *
[kube8 at cdndeployer ~]$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS
MAXPODS REPLICAS AGE
php-apache-hpa Deployment/php-apache-deployment *155%/30%* 2
15 4 27m
[kube8 at cdndeployer ~]$ kubectl get pod
NAME READY STATUS RESTARTS AGE
load-generator-1 1/1 Running 0 97s
load-generator-2 1/1 Running 0 94s
php-apache-deployment-5b65bbc75c-95k6k 1/1 Running 0 28m
*php-apache-deployment-5b65bbc75c-cjkwk 0/1 Pending 0 33s*
*php-apache-deployment-5b65bbc75c-cn5rt 0/1 Pending 0 33s*
*php-apache-deployment-5b65bbc75c-cxctx 0/1 Pending 0 48s*
php-apache-deployment-5b65bbc75c-fffnc 1/1 Running 0 64s
php-apache-deployment-5b65bbc75c-hbfw8 0/1 Pending 0 33s
php-apache-deployment-5b65bbc75c-l8496 1/1 Running 0 48s
php-apache-deployment-5b65bbc75c-mv5h6 1/1 Running 0 28m
php-apache-deployment-5b65bbc75c-qddrb 1/1 Running 0 48s
php-apache-deployment-5b65bbc75c-dd5r5 0/1 Pending 0 48s
php-apache-deployment-5b65bbc75c-tr65j 1/1 Running 0 64s
2 - The cluster is unable to create more pods/workers and I get this error
message from the pending pods
kubectl describe pod php-apache-deployment-5b65bbc75c-dd5r5
Name: php-apache-deployment-5b65bbc75c-dd5r5
Namespace: default
Priority: 0
Node: <none>
Labels: app=php-apache
pod-template-hash=5b65bbc75c
Annotations: kubernetes.io/psp: magnum.privileged
*Status: Pending*
IP:
IPs: <none>
Controlled By: ReplicaSet/php-apache-deployment-5b65bbc75c
Containers:
php-apache:
Image: k8s.gcr.io/hpa-example
Port: 80/TCP
Host Port: 0/TCP
Limits:
cpu: 500m
Requests:
cpu: 200m
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from
default-token-4fsgh (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-4fsgh:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-4fsgh
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
* Warning FailedScheduling 2m48s default-scheduler 0/4 nodes are
available: 2 Insufficient cpu, 2 node(s) had taint
{node-role.kubernetes.io/master <http://node-role.kubernetes.io/master>: },
that the pod didn't tolerate.*
* Warning FailedScheduling 2m48s default-scheduler 0/4 nodes are
available: 2 Insufficient cpu, 2 node(s) had taint
{node-role.kubernetes.io/master <http://node-role.kubernetes.io/master>: },
that the pod didn't tolerate.*
* Normal NotTriggerScaleUp 2m42s cluster-autoscaler pod didn't
trigger scale-up (it wouldn't fit if a new node is added): 1 in backoff
after failed scale-u**p*
I have this error message from the autoscaller pod
*cluster-autoscaler*-f4bd5f674-b9692
:
I1123 00:50:27.714801 1 node_instances_cache.go:168] Refresh cloud
provider node instances cache finished, refresh took 12.709µs
I1123 00:51:34.181145 1 scale_up.go:658] Scale-up: setting group
default-worker size to 3
*W1123 00:51:34.381953 1 clusterstate.go:281] Disabling scale-up for
node group default-worker until 2021-11-23 00:56:34.180840351 +0000 UTC
m=+47174.376164120; errorClass=Other; errorCode=cloudProviderError*
*E1123 00:51:34.382081 1 static_autoscaler.go:415] Failed to scale
up: failed to increase node group size: could not check current nodegroup
size: could not get cluster: Get
https://dash.cdn.domaine.tld:9511/v1/clusters/b4a6b3eb-fcf3-416f-b740-11a083d4b896
<https://dash.cdn.cerist.dz:9511/v1/clusters/b4a6b3eb-fcf3-416f-b740-11a083d4b896>:
dial tcp: lookup dash.cdn.domaine. <http://dash.cdn.cerist.dz>tld on
10.254.0.10:53 <http://10.254.0.10:53>: no such host*
W1123 00:51:44.392523 1 scale_up.go:383] Node group default-worker is
not ready for scaleup - backoff
W1123 00:51:54.410273 1 scale_up.go:383] Node group default-worker is
not ready for scaleup - backoff
W1123 00:52:04.422128 1 scale_up.go:383] Node group default-worker is
not ready for scaleup - backoff
W1123 00:52:14.434278 1 scale_up.go:383] Node group default-worker is
not ready for scaleup - backoff
W1123 00:52:24.442480 1 scale_up.go:383] Node group default-worker is
not ready for scaleup - backoff
I1123 00:52:27.715019 1 node_instances_cache.go:156] Start refreshing
cloud provider node instances cache
I did some tests on the DNS pod and :
kubectl get svc -A
NAMESPACE NAME TYPE CLUSTER-IP
EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 10.254.0.1
<none> 443/TCP 13h
default php-apache-service LoadBalancer 10.254.3.54
xx.xx.xx.213 80:31763/TCP 19m
kube-system dashboard-metrics-scraper ClusterIP 10.254.19.191
<none> 8000/TCP 13h
*kube-system kube-dns ClusterIP 10.254.0.10
<none> 53/UDP,53/TCP,9153/TCP 13h*
kube-system kubernetes-dashboard ClusterIP 10.254.132.17
<none> 443/TCP 13h
kube-system magnum-metrics-server ClusterIP 10.254.235.147
<none> 443/TCP 13h
I have noticed this behaviour about the horizon url, sometimes the dns pod
responds sometimes it does not !!!!!
[root at k8multiclustercalico-ve5t6uuoo245-master-0 ~]# *dig @10.254.0.10
<http://10.254.0.10> dash.cdn. <http://dash.cdn.cerist.dz>domaine.tld*
; <<>> DiG 9.11.28-RedHat-9.11.28-1.fc33 <<>> @10.254.0.10 dash.cdn.
<http://dash.cdn.cerist.dz>domaine.tld
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 5646
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;dash.cd <http://dash.cdn.cerist.dz>n.domaine.tld. IN A
;; AUTHORITY SECTION:
*cdn. <http://cdn.cerist.dz>domaine.tld. 30 IN SOA
cdn. <http://cdn.cerist.dz>domaine.tld. root.cdn.
<http://root.cdn.cerist.dz>domaine.tld. 2021100900 604800 86400 2419200
604800*
;; Query time: 84 msec
;; SERVER: 10.254.0.10#53(10.254.0.10)
;; WHEN: Tue Nov 23 01:08:03 UTC 2021
;; MSG SIZE rcvd: 12
2 secondes later
[root at k8multiclustercalico-ve5t6uuoo245-master-0 ~]# *dig @10.254.0.10
<http://10.254.0.10> dash.cdn. <http://dash.cdn.cerist.dz>domaine.tld*
; <<>> DiG 9.11.28-RedHat-9.11.28-1.fc33 <<>> @10.254.0.10 dash.cdn.
<http://dash.cdn.cerist.dz>domaine.tld
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7653
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;dash.cdn. <http://dash.cdn.cerist.dz>domaine.tld. IN A
;; ANSWER SECTION:
*dash.cdn. <http://dash.cdn.cerist.dz>domaine.tld. 30 IN
A xx.xx.xx.129*
;; Query time: 2 msec
;; SERVER: 10.254.0.10#53(10.254.0.10)
;; WHEN: Tue Nov 23 01:08:21 UTC 2021
;; MSG SIZE rcvd: 81
In the log of the dns pod I have this
kubectl logs *kube-dns-autoscaler*-75859754fd-q8z4w -n kube-system
*E1122 20:56:09.944449 1 autoscaler_server.go:120] Update failure:
the server could not find the requested resource E1122
20:56:19.945294 1 autoscaler_server.go:120] Update failure: the
server could not find the requested resource E1122 20:56:29.944245 1
autoscaler_server.go:120] Update failure: the server could not find the
requested resource E1122 20:56:39.946346 1 autoscaler_server.go:120]
Update failure: the server could not find the requested resource*
*E1122 20:56:49.944693 1 autoscaler_server.go:120] Update failure:
the server could not find the requested resource*
I don't have experience on kubernetes yet, could someone help me debug this?
Regards.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20211124/84cb6bb4/attachment-0001.htm>
More information about the openstack-discuss
mailing list