On 12/13/18 6:39 AM, Michał Dulko wrote: > Hi, > > In Kuryr-Kubernetes we're using the DevStack-installed etcd as a > backend store for Kubernetes that we run on our gates. For some time we > can see its degraded performance manifesting like this [1] in the logs. > Later on this leads to various K8s errors [2], [3], up to missing > notifications from the API, which causes failures in Kuryr-Kubernetes > tempest tests. From what I've seen those etcd warnings normally mean > that disk latency is high. > > This seems to be mostly happening on OVH and RAX hosts. I've looked at > this with OVH folks and there isn't anything immediately alarming about > their hosts running gate VM's. > > Upgrading the etcd version doesn't seem to help, as well as patch [4] > which increases IO priority for etcd process. > > Any ideas of what I can try next? I think we're the only project that > puts so much pressure on the DevStack's etcd. Help would really be > welcomed, getting rid of this issue will greatly increase our gates > stability. Do you by any chance use grpcio to talk to etcd? If so, it's possible you are hitting https://bugs.launchpad.net/python-tooz/+bug/1808046 In tooz that presents as random timeouts and everything taking a lot longer than it should. > > Thanks, > Michał > > [1] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-octavia/4a47162/controller/logs/screen-etcd.txt.gz#_Dec_12_17_19_33_618619 > [2] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-octavia/4a47162/controller/logs/screen-kubernetes-api.txt.gz#_Dec_12_17_20_19_772688 > [3] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-octavia/4a47162/controller/logs/screen-kubernetes-scheduler.txt.gz#_Dec_12_17_18_59_045347 > [4] https://review.openstack.org/#/c/624730/ > >