[magnum][kolla] etcd wal sync duration issue
Hi, We are using the following coe cluster template and cluster create commands on an OpenStack Stein installation that installs Magnum 8.2.0 Kolla containers installed by Kolla-Ansible 8.0.1: openstack coe cluster template create \ --image Fedora-AtomicHost-29-20191126.0.x86_64_raw \ --keypair userkey \ --external-network ext-net \ --dns-nameserver 1.1.1.1 \ --master-flavor c5sd.4xlarge \ --flavor m5sd.4xlarge \ --coe kubernetes \ --network-driver flannel \ --volume-driver cinder \ --docker-storage-driver overlay2 \ --docker-volume-size 100 \ --registry-enabled \ --master-lb-enabled \ --floating-ip-disabled \ --fixed-network KubernetesProjectNetwork001 \ --fixed-subnet KubernetesProjectSubnet001 \ --labels kube_tag=v1.15.7,cloud_provider_tag=v1.15.0,heat_container_agent_tag=ste in-dev,master_lb_floating_ip_enabled=true \ k8s-cluster-template-1.15.7-production-private openstack coe cluster create \ --cluster-template k8s-cluster-template-1.15.7-production-private \ --keypair userkey \ --master-count 3 \ --node-count 3 \ k8s-cluster001 The deploy process works perfectly, however, the cluster health status flips between healthy and unhealthy. The unhealthy status indicates that etcd has an issue. When logged into master-0 (out of 3, as configured above), "systemctl status etcd" shows the stdout from etcd, which shows: Jan 11 17:27:36 k8s-cluster001-4effrc2irvjq-master-0.novalocal runc[2725]: 2020-01-11 17:27:36.548453 W | etcdserver: timed out waiting for read index response Jan 11 17:28:02 k8s-cluster001-4effrc2irvjq-master-0.novalocal runc[2725]: 2020-01-11 17:28:02.960977 W | wal: sync duration of 1.696804699s, expected less than 1s Jan 11 17:28:31 k8s-cluster001-4effrc2irvjq-master-0.novalocal runc[2725]: 2020-01-11 17:28:31.292753 W | wal: sync duration of 2.249722223s, expected less than 1s We also see: Jan 11 17:40:39 k8s-cluster001-4effrc2irvjq-master-0.novalocal runc[2725]: 2020-01-11 17:40:39.132459 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: stream error: code = DeadlineExceeded desc = "context deadline exceeded" We initially used relatively small flavors, but increased these to something very large to be sure resources were not constrained in any way. "top" reported no CPU nor memory contention on any nodes in either case. Multiple clusters have been deployed, and they all have this issue, including empty clusters that were just deployed. I see a very large number of reports of similar issues with etcd, but discussions lead to disk performance, which can't be the cause here, not only because persistent storage for etcd isn't configured in Magnum, but also the disks are "very" fast in this environment. Looking at "vmstat -D" from within master-0, the number of writes is minimal. Ceilometer logs about 15 to 20 write IOPS for this VM in Gnocchi. Any ideas? We are finalizing procedures to upgrade to Train, so we wanted to be sure that we weren't running into some common issue with Stein that would immediately be solved with Train. If so, we will simply proceed with the upgrade and avoid diagnosing this issue further. Thanks! Eric
Hi Eric, That issue looks familiar for me. There are some questions I'd like to check before answering if you should upgrade to train. 1. Are using the default v3.2.7 version for etcd? 2. Did you try to reproduce this with devstack, using Fedora CoreOS driver? The etcd version could be 3.2.26 I asked above questions because I saw the same error when I used Fedora Atomic with etcd v3.2.7 and I can't reproduce it with Fedora CoreOS + etcd 3.2.26 On 12/01/20 6:44 AM, Eric K. Miller wrote:
Hi,
We are using the following coe cluster template and cluster create commands on an OpenStack Stein installation that installs Magnum 8.2.0 Kolla containers installed by Kolla-Ansible 8.0.1:
openstack coe cluster template create \
--image Fedora-AtomicHost-29-20191126.0.x86_64_raw \
--keypair userkey \
--external-network ext-net \
--dns-nameserver 1.1.1.1 \
--master-flavor c5sd.4xlarge \
--flavor m5sd.4xlarge \
--coe kubernetes \
--network-driver flannel \
--volume-driver cinder \
--docker-storage-driver overlay2 \
--docker-volume-size 100 \
--registry-enabled \
--master-lb-enabled \
--floating-ip-disabled \
--fixed-network KubernetesProjectNetwork001 \
--fixed-subnet KubernetesProjectSubnet001 \
--labels kube_tag=v1.15.7,cloud_provider_tag=v1.15.0,heat_container_agent_tag=stein-dev,master_lb_floating_ip_enabled=true \
k8s-cluster-template-1.15.7-production-private
openstack coe cluster create \
--cluster-template k8s-cluster-template-1.15.7-production-private \
--keypair userkey \
--master-count 3 \
--node-count 3 \
k8s-cluster001
The deploy process works perfectly, however, the cluster health status flips between healthy and unhealthy. The unhealthy status indicates that etcd has an issue.
When logged into master-0 (out of 3, as configured above), "systemctl status etcd" shows the stdout from etcd, which shows:
Jan 11 17:27:36 k8s-cluster001-4effrc2irvjq-master-0.novalocal runc[2725]: 2020-01-11 17:27:36.548453 W | etcdserver: timed out waiting for read index response
Jan 11 17:28:02 k8s-cluster001-4effrc2irvjq-master-0.novalocal runc[2725]: 2020-01-11 17:28:02.960977 W | wal: sync duration of 1.696804699s, expected less than 1s
Jan 11 17:28:31 k8s-cluster001-4effrc2irvjq-master-0.novalocal runc[2725]: 2020-01-11 17:28:31.292753 W | wal: sync duration of 2.249722223s, expected less than 1s
We also see:
Jan 11 17:40:39 k8s-cluster001-4effrc2irvjq-master-0.novalocal runc[2725]: 2020-01-11 17:40:39.132459 I | etcdserver/api/v3rpc: grpc: Server.processUnaryRPC failed to write status: stream error: code = DeadlineExceeded desc = "context deadline exceeded"
We initially used relatively small flavors, but increased these to something very large to be sure resources were not constrained in any way. "top" reported no CPU nor memory contention on any nodes in either case.
Multiple clusters have been deployed, and they all have this issue, including empty clusters that were just deployed.
I see a very large number of reports of similar issues with etcd, but discussions lead to disk performance, which can't be the cause here, not only because persistent storage for etcd isn't configured in Magnum, but also the disks are "very" fast in this environment. Looking at "vmstat -D" from within master-0, the number of writes is minimal. Ceilometer logs about 15 to 20 write IOPS for this VM in Gnocchi.
Any ideas?
We are finalizing procedures to upgrade to Train, so we wanted to be sure that we weren't running into some common issue with Stein that would immediately be solved with Train. If so, we will simply proceed with the upgrade and avoid diagnosing this issue further.
Thanks!
Eric
-- Cheers & Best regards, Feilong Wang (王飞龙) Head of R&D Catalyst Cloud - Cloud Native New Zealand -------------------------------------------------------------------------- Tel: +64-48032246 Email: flwang@catalyst.net.nz Level 6, Catalyst House, 150 Willis Street, Wellington --------------------------------------------------------------------------
Hi Feilong, Thanks for responding! I am, indeed, using the default v3.2.7 version for etcd, which is the only available image. I did not try to reproduce with any other driver (we have never used DevStack, honestly, only Kolla-Ansible deployments). I did see a number of people indicating similar issues with etcd versions in the 3.3.x range, so I didn't think of it being an etcd issue, but then again most issues seem to be a result of people using HDDs and not SSDs, which makes sense. Interesting that you saw the same issue, though. We haven't tried Fedora CoreOS, but I think we would need Train for this. Everything I read about etcd indicates that it is extremely latency sensitive, due to the fact that it replicates all changes to all nodes and sends an fsync to Linux each time, so data is always guaranteed to be stored. I can see this becoming an issue quickly without super-low-latency network and storage. We are using Ceph-based SSD volumes for the Kubernetes Master node disks, which is extremely fast (likely 10x or better than anything people recommend for etcd), but network latency is always going to be higher with VMs on OpenStack with DVR than bare metal with VLANs due to all of the abstractions. Do you know who maintains the etcd images for Magnum here? Is there an easy way to create a newer image? https://hub.docker.com/r/openstackmagnum/etcd/tags/ Eric From: Feilong Wang [mailto:feilong@catalyst.net.nz] Sent: Monday, January 13, 2020 3:39 PM To: openstack-discuss@lists.openstack.org Subject: Re: [magnum][kolla] etcd wal sync duration issue Hi Eric, That issue looks familiar for me. There are some questions I'd like to check before answering if you should upgrade to train. 1. Are using the default v3.2.7 version for etcd? 2. Did you try to reproduce this with devstack, using Fedora CoreOS driver? The etcd version could be 3.2.26 I asked above questions because I saw the same error when I used Fedora Atomic with etcd v3.2.7 and I can't reproduce it with Fedora CoreOS + etcd 3.2.26
Just to clarify: this etcd is not provided by Kolla nor installed by Kolla-Ansible. -yoctozepto pon., 13 sty 2020 o 22:54 Eric K. Miller <emiller@genesishosting.com> napisał(a):
Hi Feilong,
Thanks for responding! I am, indeed, using the default v3.2.7 version for etcd, which is the only available image.
I did not try to reproduce with any other driver (we have never used DevStack, honestly, only Kolla-Ansible deployments). I did see a number of people indicating similar issues with etcd versions in the 3.3.x range, so I didn't think of it being an etcd issue, but then again most issues seem to be a result of people using HDDs and not SSDs, which makes sense.
Interesting that you saw the same issue, though. We haven't tried Fedora CoreOS, but I think we would need Train for this.
Everything I read about etcd indicates that it is extremely latency sensitive, due to the fact that it replicates all changes to all nodes and sends an fsync to Linux each time, so data is always guaranteed to be stored. I can see this becoming an issue quickly without super-low-latency network and storage. We are using Ceph-based SSD volumes for the Kubernetes Master node disks, which is extremely fast (likely 10x or better than anything people recommend for etcd), but network latency is always going to be higher with VMs on OpenStack with DVR than bare metal with VLANs due to all of the abstractions.
Do you know who maintains the etcd images for Magnum here? Is there an easy way to create a newer image? https://hub.docker.com/r/openstackmagnum/etcd/tags/
Eric
From: Feilong Wang [mailto:feilong@catalyst.net.nz] Sent: Monday, January 13, 2020 3:39 PM To: openstack-discuss@lists.openstack.org Subject: Re: [magnum][kolla] etcd wal sync duration issue
Hi Eric, That issue looks familiar for me. There are some questions I'd like to check before answering if you should upgrade to train. 1. Are using the default v3.2.7 version for etcd? 2. Did you try to reproduce this with devstack, using Fedora CoreOS driver? The etcd version could be 3.2.26 I asked above questions because I saw the same error when I used Fedora Atomic with etcd v3.2.7 and I can't reproduce it with Fedora CoreOS + etcd 3.2.26
Hi Eric, If you're using SSD, then I think the IO performance should be OK. You can use this https://github.com/etcd-io/etcd/tree/master/tools/benchmark to verify and confirm that 's the root cause. Meanwhile, you can review the config of etcd cluster deployed by Magnum. I'm not an export of Etcd, so TBH I can't see anything wrong with the config. Most of them are just default configurations. As for the etcd image, it's built from https://github.com/projectatomic/atomic-system-containers/tree/master/etcd or you can refer CERN's repo https://gitlab.cern.ch/cloud/atomic-system-containers/blob/cern-qa/etcd/ *Spyros*, any comments? On 14/01/20 10:52 AM, Eric K. Miller wrote:
Hi Feilong,
Thanks for responding! I am, indeed, using the default v3.2.7 version for etcd, which is the only available image.
I did not try to reproduce with any other driver (we have never used DevStack, honestly, only Kolla-Ansible deployments). I did see a number of people indicating similar issues with etcd versions in the 3.3.x range, so I didn't think of it being an etcd issue, but then again most issues seem to be a result of people using HDDs and not SSDs, which makes sense.
Interesting that you saw the same issue, though. We haven't tried Fedora CoreOS, but I think we would need Train for this.
Everything I read about etcd indicates that it is extremely latency sensitive, due to the fact that it replicates all changes to all nodes and sends an fsync to Linux each time, so data is always guaranteed to be stored. I can see this becoming an issue quickly without super-low-latency network and storage. We are using Ceph-based SSD volumes for the Kubernetes Master node disks, which is extremely fast (likely 10x or better than anything people recommend for etcd), but network latency is always going to be higher with VMs on OpenStack with DVR than bare metal with VLANs due to all of the abstractions.
Do you know who maintains the etcd images for Magnum here? Is there an easy way to create a newer image? https://hub.docker.com/r/openstackmagnum/etcd/tags/
Eric
From: Feilong Wang [mailto:feilong@catalyst.net.nz] Sent: Monday, January 13, 2020 3:39 PM To: openstack-discuss@lists.openstack.org Subject: Re: [magnum][kolla] etcd wal sync duration issue
Hi Eric, That issue looks familiar for me. There are some questions I'd like to check before answering if you should upgrade to train. 1. Are using the default v3.2.7 version for etcd? 2. Did you try to reproduce this with devstack, using Fedora CoreOS driver? The etcd version could be 3.2.26 I asked above questions because I saw the same error when I used Fedora Atomic with etcd v3.2.7 and I can't reproduce it with Fedora CoreOS + etcd 3.2.26
-- Cheers & Best regards, Feilong Wang (王飞龙) ------------------------------------------------------ Senior Cloud Software Engineer Tel: +64-48032246 Email: flwang@catalyst.net.nz Catalyst IT Limited Level 6, Catalyst House, 150 Willis Street, Wellington ------------------------------------------------------
Hi Feilong, Before I was able to use the benchmark tool you mentioned, we saw some other slowdowns with Ceph (all flash). It appears that something must have crashed somewhere since we had to restart a couple things, after which etcd has been performing fine and no more health issues being reported by Magnum. So, it looks like it wasn't etcd related afterall. However, while researching, I found that etcd's fsync on every write (so it guarantees a write cache flush for each write) apparently creates some havoc with some SSDs, where the SSD performs a full cache flush of multiple caches. This article explains it a LOT better: https://yourcmc.ru/wiki/Ceph_performance (scroll to the "Drive cache is slowing you down" section) It seems that the optimal configuration for etcd would be to use local drives in each node and be sure that the write cache is disabled in the SSDs - as opposed to using Ceph volumes, which already adds network latency, but can create even more latency for synchronizations due to Ceph's replication. Eric From: feilong [mailto:feilong@catalyst.net.nz] Sent: Wednesday, January 15, 2020 2:36 PM To: Eric K. Miller; openstack-discuss@lists.openstack.org Cc: Spyros Trigazis Subject: Re: [magnum][kolla] etcd wal sync duration issue Hi Eric, If you're using SSD, then I think the IO performance should be OK. You can use this https://github.com/etcd-io/etcd/tree/master/tools/benchmark to verify and confirm that 's the root cause. Meanwhile, you can review the config of etcd cluster deployed by Magnum. I'm not an export of Etcd, so TBH I can't see anything wrong with the config. Most of them are just default configurations. As for the etcd image, it's built from https://github.com/projectatomic/atomic-system-containers/tree/master/etcd or you can refer CERN's repo https://gitlab.cern.ch/cloud/atomic-system-containers/blob/cern-qa/etcd/ Spyros, any comments? On 14/01/20 10:52 AM, Eric K. Miller wrote: Hi Feilong, Thanks for responding! I am, indeed, using the default v3.2.7 version for etcd, which is the only available image. I did not try to reproduce with any other driver (we have never used DevStack, honestly, only Kolla-Ansible deployments). I did see a number of people indicating similar issues with etcd versions in the 3.3.x range, so I didn't think of it being an etcd issue, but then again most issues seem to be a result of people using HDDs and not SSDs, which makes sense. Interesting that you saw the same issue, though. We haven't tried Fedora CoreOS, but I think we would need Train for this. Everything I read about etcd indicates that it is extremely latency sensitive, due to the fact that it replicates all changes to all nodes and sends an fsync to Linux each time, so data is always guaranteed to be stored. I can see this becoming an issue quickly without super-low-latency network and storage. We are using Ceph-based SSD volumes for the Kubernetes Master node disks, which is extremely fast (likely 10x or better than anything people recommend for etcd), but network latency is always going to be higher with VMs on OpenStack with DVR than bare metal with VLANs due to all of the abstractions. Do you know who maintains the etcd images for Magnum here? Is there an easy way to create a newer image? https://hub.docker.com/r/openstackmagnum/etcd/tags/ Eric From: Feilong Wang [mailto:feilong@catalyst.net.nz] Sent: Monday, January 13, 2020 3:39 PM To: openstack-discuss@lists.openstack.org Subject: Re: [magnum][kolla] etcd wal sync duration issue Hi Eric, That issue looks familiar for me. There are some questions I'd like to check before answering if you should upgrade to train. 1. Are using the default v3.2.7 version for etcd? 2. Did you try to reproduce this with devstack, using Fedora CoreOS driver? The etcd version could be 3.2.26 I asked above questions because I saw the same error when I used Fedora Atomic with etcd v3.2.7 and I can't reproduce it with Fedora CoreOS + etcd 3.2.26 -- Cheers & Best regards, Feilong Wang (王飞龙) ------------------------------------------------------ Senior Cloud Software Engineer Tel: +64-48032246 Email: flwang@catalyst.net.nz Catalyst IT Limited Level 6, Catalyst House, 150 Willis Street, Wellington ------------------------------------------------------
Hi Eric, Thanks for sharing the article. As for the etcd volumes, you can disable it by without setting the etcd_volume_size label. Just FYI. On 17/01/20 6:00 AM, Eric K. Miller wrote:
Hi Feilong,
Before I was able to use the benchmark tool you mentioned, we saw some other slowdowns with Ceph (all flash). It appears that something must have crashed somewhere since we had to restart a couple things, after which etcd has been performing fine and no more health issues being reported by Magnum.
So, it looks like it wasn't etcd related afterall.
However, while researching, I found that etcd's fsync on every write (so it guarantees a write cache flush for each write) apparently creates some havoc with some SSDs, where the SSD performs a full cache flush of multiple caches. This article explains it a LOT better: https://yourcmc.ru/wiki/Ceph_performance (scroll to the "Drive cache is slowing you down" section)
It seems that the optimal configuration for etcd would be to use local drives in each node and be sure that the write cache is disabled in the SSDs - as opposed to using Ceph volumes, which already adds network latency, but can create even more latency for synchronizations due to Ceph's replication.
Eric
*From:*feilong [mailto:feilong@catalyst.net.nz] *Sent:* Wednesday, January 15, 2020 2:36 PM *To:* Eric K. Miller; openstack-discuss@lists.openstack.org *Cc:* Spyros Trigazis *Subject:* Re: [magnum][kolla] etcd wal sync duration issue
Hi Eric,
If you're using SSD, then I think the IO performance should be OK. You can use this https://github.com/etcd-io/etcd/tree/master/tools/benchmark to verify and confirm that 's the root cause. Meanwhile, you can review the config of etcd cluster deployed by Magnum. I'm not an export of Etcd, so TBH I can't see anything wrong with the config. Most of them are just default configurations.
As for the etcd image, it's built from https://github.com/projectatomic/atomic-system-containers/tree/master/etcd or you can refer CERN's repo https://gitlab.cern.ch/cloud/atomic-system-containers/blob/cern-qa/etcd/
*Spyros*, any comments?
On 14/01/20 10:52 AM, Eric K. Miller wrote:
Hi Feilong,
Thanks for responding! I am, indeed, using the default v3.2.7 version for etcd, which is the only available image.
I did not try to reproduce with any other driver (we have never used DevStack, honestly, only Kolla-Ansible deployments). I did see a number of people indicating similar issues with etcd versions in the 3.3.x range, so I didn't think of it being an etcd issue, but then again most issues seem to be a result of people using HDDs and not SSDs, which makes sense.
Interesting that you saw the same issue, though. We haven't tried Fedora CoreOS, but I think we would need Train for this.
Everything I read about etcd indicates that it is extremely latency sensitive, due to the fact that it replicates all changes to all nodes and sends an fsync to Linux each time, so data is always guaranteed to be stored. I can see this becoming an issue quickly without super-low-latency network and storage. We are using Ceph-based SSD volumes for the Kubernetes Master node disks, which is extremely fast (likely 10x or better than anything people recommend for etcd), but network latency is always going to be higher with VMs on OpenStack with DVR than bare metal with VLANs due to all of the abstractions.
Do you know who maintains the etcd images for Magnum here? Is there an easy way to create a newer image?
https://hub.docker.com/r/openstackmagnum/etcd/tags/
Eric
From: Feilong Wang [mailto:feilong@catalyst.net.nz]
Sent: Monday, January 13, 2020 3:39 PM
To: openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org>
Subject: Re: [magnum][kolla] etcd wal sync duration issue
Hi Eric,
That issue looks familiar for me. There are some questions I'd like to check before answering if you should upgrade to train.
1. Are using the default v3.2.7 version for etcd?
2. Did you try to reproduce this with devstack, using Fedora CoreOS driver? The etcd version could be 3.2.26
I asked above questions because I saw the same error when I used Fedora Atomic with etcd v3.2.7 and I can't reproduce it with Fedora CoreOS + etcd 3.2.26
-- Cheers & Best regards, Feilong Wang (王飞龙) ------------------------------------------------------ Senior Cloud Software Engineer Tel: +64-48032246 Email: flwang@catalyst.net.nz <mailto:flwang@catalyst.net.nz> Catalyst IT Limited Level 6, Catalyst House, 150 Willis Street, Wellington ------------------------------------------------------
-- Cheers & Best regards, Feilong Wang (王飞龙) Head of R&D Catalyst Cloud - Cloud Native New Zealand -------------------------------------------------------------------------- Tel: +64-48032246 Email: flwang@catalyst.net.nz Level 6, Catalyst House, 150 Willis Street, Wellington --------------------------------------------------------------------------
participants (4)
-
Eric K. Miller
-
feilong
-
Feilong Wang
-
Radosław Piliszek