[dev] [infra] [devstack] [qa] [kuryr] DevStack's etcd performance on gate VM's
Clark Boylan
cboylan at sapwetik.org
Thu Dec 13 17:06:17 UTC 2018
On Thu, Dec 13, 2018, at 4:39 AM, Michał Dulko wrote:
> Hi,
>
> In Kuryr-Kubernetes we're using the DevStack-installed etcd as a
> backend store for Kubernetes that we run on our gates. For some time we
> can see its degraded performance manifesting like this [1] in the logs.
> Later on this leads to various K8s errors [2], [3], up to missing
> notifications from the API, which causes failures in Kuryr-Kubernetes
> tempest tests. From what I've seen those etcd warnings normally mean
> that disk latency is high.
>
> This seems to be mostly happening on OVH and RAX hosts. I've looked at
> this with OVH folks and there isn't anything immediately alarming about
> their hosts running gate VM's.
That's interesting because we've been working with amorin at OVH over debugging similar IO problems and I think we both agree something is happening. We've disabled the BHS1 region as the vast majority of related failures were there, but kept GRA1 up and running which is where your example is from. My understanding is that a memory issue of some sort was found on the compute hypervisors (which could affect disk throughput if there isn't memory for caching available or if swap is using up available disk IO). We are currently waiting on amorin's go ahead to turn BHS1 back on after this is corrected.
>
> Upgrading the etcd version doesn't seem to help, as well as patch [4]
> which increases IO priority for etcd process.
>
> Any ideas of what I can try next? I think we're the only project that
> puts so much pressure on the DevStack's etcd. Help would really be
> welcomed, getting rid of this issue will greatly increase our gates
> stability.
It wouldn't surprise me if others aren't using etcd much. One thing that may help is to use the dstat data [5] from these failed jobs to rule out resource contention from within the job (cpu, io(ps), memory, etc). One thing we've found debugging these slower nodes is that it often exposes real bugs in our software by making them cost more. We should double check there isn't anything obvious like that happening here too.
I've been putting the csv file in https://lamada.eu/dstat-graph/ and that renders it for human consumption. But there are other tools out there for this too.
>
> Thanks,
> Michał
>
> [1]
> http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-octavia/4a47162/controller/logs/screen-etcd.txt.gz#_Dec_12_17_19_33_618619
> [2]
> http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-octavia/4a47162/controller/logs/screen-kubernetes-api.txt.gz#_Dec_12_17_20_19_772688
> [3]
> http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-octavia/4a47162/controller/logs/screen-kubernetes-scheduler.txt.gz#_Dec_12_17_18_59_045347
> [4] https://review.openstack.org/#/c/624730/
>
>
[5] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-octavia/4a47162/controller/logs/dstat-csv_log.txt
More information about the openstack-discuss
mailing list