[magnum][kolla] etcd wal sync duration issue

Feilong Wang feilong at catalyst.net.nz
Mon Jan 13 21:38:56 UTC 2020


Hi Eric,

That issue looks familiar for me. There are some questions I'd like to
check before answering if you should upgrade to train.

1. Are using the default v3.2.7 version for etcd?

2. Did you try to reproduce this with devstack, using Fedora CoreOS
driver? The etcd version could be 3.2.26

I asked above questions because I saw the same error when I used Fedora
Atomic with etcd v3.2.7 and I can't reproduce it with Fedora CoreOS +
etcd 3.2.26



On 12/01/20 6:44 AM, Eric K. Miller wrote:
>
> Hi,
>
>  
>
> We are using the following coe cluster template and cluster create
> commands on an OpenStack Stein installation that installs Magnum 8.2.0
> Kolla containers installed by Kolla-Ansible 8.0.1:
>
>  
>
> openstack coe cluster template create \
>
>   --image Fedora-AtomicHost-29-20191126.0.x86_64_raw \
>
>   --keypair userkey \
>
>   --external-network ext-net \
>
>   --dns-nameserver 1.1.1.1 \
>
>   --master-flavor c5sd.4xlarge \
>
>   --flavor m5sd.4xlarge \
>
>   --coe kubernetes \
>
>   --network-driver flannel \
>
>   --volume-driver cinder \
>
>   --docker-storage-driver overlay2 \
>
>   --docker-volume-size 100 \
>
>   --registry-enabled \
>
>  --master-lb-enabled \
>
>   --floating-ip-disabled \
>
>   --fixed-network KubernetesProjectNetwork001 \
>
>   --fixed-subnet KubernetesProjectSubnet001 \
>
>   --labels
> kube_tag=v1.15.7,cloud_provider_tag=v1.15.0,heat_container_agent_tag=stein-dev,master_lb_floating_ip_enabled=true
> \
>
>   k8s-cluster-template-1.15.7-production-private
>
>  
>
> openstack coe cluster create \
>
>   --cluster-template k8s-cluster-template-1.15.7-production-private \
>
>   --keypair userkey \
>
>   --master-count 3 \
>
>   --node-count 3 \
>
>   k8s-cluster001
>
>  
>
> The deploy process works perfectly, however, the cluster health status
> flips between healthy and unhealthy.  The unhealthy status indicates
> that etcd has an issue.
>
>  
>
> When logged into master-0 (out of 3, as configured above), "systemctl
> status etcd" shows the stdout from etcd, which shows:
>
>  
>
> Jan 11 17:27:36 k8s-cluster001-4effrc2irvjq-master-0.novalocal
> runc[2725]: 2020-01-11 17:27:36.548453 W | etcdserver: timed out
> waiting for read index response
>
> Jan 11 17:28:02 k8s-cluster001-4effrc2irvjq-master-0.novalocal
> runc[2725]: 2020-01-11 17:28:02.960977 W | wal: sync duration of
> 1.696804699s, expected less than 1s
>
> Jan 11 17:28:31 k8s-cluster001-4effrc2irvjq-master-0.novalocal
> runc[2725]: 2020-01-11 17:28:31.292753 W | wal: sync duration of
> 2.249722223s, expected less than 1s
>
>  
>
> We also see:
>
> Jan 11 17:40:39 k8s-cluster001-4effrc2irvjq-master-0.novalocal
> runc[2725]: 2020-01-11 17:40:39.132459 I | etcdserver/api/v3rpc: grpc:
> Server.processUnaryRPC failed to write status: stream error: code =
> DeadlineExceeded desc = "context deadline exceeded"
>
>  
>
> We initially used relatively small flavors, but increased these to
> something very large to be sure resources were not constrained in any
> way.  "top" reported no CPU nor memory contention on any nodes in
> either case.
>
>  
>
> Multiple clusters have been deployed, and they all have this issue,
> including empty clusters that were just deployed.
>
>  
>
> I see a very large number of reports of similar issues with etcd, but
> discussions lead to disk performance, which can't be the cause here,
> not only because persistent storage for etcd isn't configured in
> Magnum, but also the disks are "very" fast in this environment. 
> Looking at "vmstat -D" from within master-0, the number of writes is
> minimal.  Ceilometer logs about 15 to 20 write IOPS for this VM in
> Gnocchi.
>
>  
>
> Any ideas?
>
>  
>
> We are finalizing procedures to upgrade to Train, so we wanted to be
> sure that we weren't running into some common issue with Stein that
> would immediately be solved with Train.  If so, we will simply proceed
> with the upgrade and avoid diagnosing this issue further.
>
>
> Thanks!
>
>  
>
> Eric
>
>  
>
-- 
Cheers & Best regards,
Feilong Wang (王飞龙)
Head of R&D
Catalyst Cloud - Cloud Native New Zealand
--------------------------------------------------------------------------
Tel: +64-48032246
Email: flwang at catalyst.net.nz
Level 6, Catalyst House, 150 Willis Street, Wellington
-------------------------------------------------------------------------- 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200114/8f405720/attachment.html>


More information about the openstack-discuss mailing list