On Thu, 2018-12-13 at 07:45 -0600, Ben Nemec wrote:
On 12/13/18 6:39 AM, Michał Dulko wrote:
Hi,
In Kuryr-Kubernetes we're using the DevStack-installed etcd as a backend store for Kubernetes that we run on our gates. For some time we can see its degraded performance manifesting like this [1] in the logs. Later on this leads to various K8s errors [2], [3], up to missing notifications from the API, which causes failures in Kuryr-Kubernetes tempest tests. From what I've seen those etcd warnings normally mean that disk latency is high.
This seems to be mostly happening on OVH and RAX hosts. I've looked at this with OVH folks and there isn't anything immediately alarming about their hosts running gate VM's.
Upgrading the etcd version doesn't seem to help, as well as patch [4] which increases IO priority for etcd process.
Any ideas of what I can try next? I think we're the only project that puts so much pressure on the DevStack's etcd. Help would really be welcomed, getting rid of this issue will greatly increase our gates stability.
Do you by any chance use grpcio to talk to etcd? If so, it's possible you are hitting https://bugs.launchpad.net/python-tooz/+bug/1808046
In tooz that presents as random timeouts and everything taking a lot longer than it should.
Seems like it's something else. We don't call etcd from Python using any lib. It's only Kubernetes that's doing that in our gates.
Thanks, Michał
[1] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [2] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [3] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [4] https://review.openstack.org/#/c/624730/