Re: [dev] [infra] [devstack] [qa] [kuryr] DevStack's etcd performance on gate VM's

13 Dec 2018


      On Thu, 2018-12-13 at 07:45 -0600, Ben Nemec wrote:
...
On 12/13/18 6:39 AM, Michał Dulko wrote:
...
Hi,
In Kuryr-Kubernetes we're using the DevStack-installed etcd as a
backend store for Kubernetes that we run on our gates. For some time we
can see its degraded performance manifesting like this [1] in the logs.
Later on this leads to various K8s errors [2], [3], up to missing
notifications from the API, which causes failures in Kuryr-Kubernetes
tempest tests. From what I've seen those etcd warnings normally mean
that disk latency is high.
This seems to be mostly happening on OVH and RAX hosts. I've looked at
this with OVH folks and there isn't anything immediately alarming about
their hosts running gate VM's.
Upgrading the etcd version doesn't seem to help, as well as patch [4]
which increases IO priority for etcd process.
Any ideas of what I can try next? I think we're the only project that
puts so much pressure on the DevStack's etcd. Help would really be
welcomed, getting rid of this issue will greatly increase our gates
stability.
Do you by any chance use grpcio to talk to etcd? If so, it's possible 
you are hitting https://bugs.launchpad.net/python-tooz/+bug/1808046
In tooz that presents as random timeouts and everything taking a lot 
longer than it should.
Seems like it's something else. We don't call etcd from Python using
any lib. It's only Kubernetes that's doing that in our gates.
...
...
Thanks,
Michał
[1] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-...
[2] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-...
[3] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-...
[4] https://review.openstack.org/#/c/624730/