[dev] [infra] [devstack] [qa] [kuryr] DevStack's etcd performance on gate VM's

newer
[sig][upgrades] 2019 reboot/irc...

Michał Dulko

13 Dec 2018 13 Dec '18

5:39 p.m.

Hi, In Kuryr-Kubernetes we're using the DevStack-installed etcd as a backend store for Kubernetes that we run on our gates. For some time we can see its degraded performance manifesting like this [1] in the logs. Later on this leads to various K8s errors [2], [3], up to missing notifications from the API, which causes failures in Kuryr-Kubernetes tempest tests. From what I've seen those etcd warnings normally mean that disk latency is high. This seems to be mostly happening on OVH and RAX hosts. I've looked at this with OVH folks and there isn't anything immediately alarming about their hosts running gate VM's. Upgrading the etcd version doesn't seem to help, as well as patch [4] which increases IO priority for etcd process. Any ideas of what I can try next? I think we're the only project that puts so much pressure on the DevStack's etcd. Help would really be welcomed, getting rid of this issue will greatly increase our gates stability. Thanks, Michał [1] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [2] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [3] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [4] https://review.openstack.org/#/c/624730/

Show replies by date

Ben Nemec

13 Dec 13 Dec

6:45 p.m.

On 12/13/18 6:39 AM, Michał Dulko wrote:

...

Hi,

In Kuryr-Kubernetes we're using the DevStack-installed etcd as a backend store for Kubernetes that we run on our gates. For some time we can see its degraded performance manifesting like this [1] in the logs. Later on this leads to various K8s errors [2], [3], up to missing notifications from the API, which causes failures in Kuryr-Kubernetes tempest tests. From what I've seen those etcd warnings normally mean that disk latency is high.

This seems to be mostly happening on OVH and RAX hosts. I've looked at this with OVH folks and there isn't anything immediately alarming about their hosts running gate VM's.

Upgrading the etcd version doesn't seem to help, as well as patch [4] which increases IO priority for etcd process.

Any ideas of what I can try next? I think we're the only project that puts so much pressure on the DevStack's etcd. Help would really be welcomed, getting rid of this issue will greatly increase our gates stability.

Do you by any chance use grpcio to talk to etcd? If so, it's possible you are hitting https://bugs.launchpad.net/python-tooz/+bug/1808046 In tooz that presents as random timeouts and everything taking a lot longer than it should.

...

Thanks, Michał

[1] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [2] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [3] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [4] https://review.openstack.org/#/c/624730/

Michał Dulko

9:20 p.m.

On Thu, 2018-12-13 at 07:45 -0600, Ben Nemec wrote:

...

On 12/13/18 6:39 AM, Michał Dulko wrote:

...
Hi,

In Kuryr-Kubernetes we're using the DevStack-installed etcd as a backend store for Kubernetes that we run on our gates. For some time we can see its degraded performance manifesting like this [1] in the logs. Later on this leads to various K8s errors [2], [3], up to missing notifications from the API, which causes failures in Kuryr-Kubernetes tempest tests. From what I've seen those etcd warnings normally mean that disk latency is high.

This seems to be mostly happening on OVH and RAX hosts. I've looked at this with OVH folks and there isn't anything immediately alarming about their hosts running gate VM's.

Upgrading the etcd version doesn't seem to help, as well as patch [4] which increases IO priority for etcd process.

Any ideas of what I can try next? I think we're the only project that puts so much pressure on the DevStack's etcd. Help would really be welcomed, getting rid of this issue will greatly increase our gates stability.

Do you by any chance use grpcio to talk to etcd? If so, it's possible you are hitting https://bugs.launchpad.net/python-tooz/+bug/1808046

In tooz that presents as random timeouts and everything taking a lot longer than it should.

Seems like it's something else. We don't call etcd from Python using any lib. It's only Kubernetes that's doing that in our gates.

...

...
Thanks, Michał

[1] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [2] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [3] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [4] https://review.openstack.org/#/c/624730/

Clark Boylan

10:06 p.m.

On Thu, Dec 13, 2018, at 4:39 AM, Michał Dulko wrote:

...

Hi,

In Kuryr-Kubernetes we're using the DevStack-installed etcd as a backend store for Kubernetes that we run on our gates. For some time we can see its degraded performance manifesting like this [1] in the logs. Later on this leads to various K8s errors [2], [3], up to missing notifications from the API, which causes failures in Kuryr-Kubernetes tempest tests. From what I've seen those etcd warnings normally mean that disk latency is high.

This seems to be mostly happening on OVH and RAX hosts. I've looked at this with OVH folks and there isn't anything immediately alarming about their hosts running gate VM's.

That's interesting because we've been working with amorin at OVH over debugging similar IO problems and I think we both agree something is happening. We've disabled the BHS1 region as the vast majority of related failures were there, but kept GRA1 up and running which is where your example is from. My understanding is that a memory issue of some sort was found on the compute hypervisors (which could affect disk throughput if there isn't memory for caching available or if swap is using up available disk IO). We are currently waiting on amorin's go ahead to turn BHS1 back on after this is corrected.

...

Upgrading the etcd version doesn't seem to help, as well as patch [4] which increases IO priority for etcd process.

Any ideas of what I can try next? I think we're the only project that puts so much pressure on the DevStack's etcd. Help would really be welcomed, getting rid of this issue will greatly increase our gates stability.

It wouldn't surprise me if others aren't using etcd much. One thing that may help is to use the dstat data [5] from these failed jobs to rule out resource contention from within the job (cpu, io(ps), memory, etc). One thing we've found debugging these slower nodes is that it often exposes real bugs in our software by making them cost more. We should double check there isn't anything obvious like that happening here too. I've been putting the csv file in https://lamada.eu/dstat-graph/ and that renders it for human consumption. But there are other tools out there for this too.

...

Thanks, Michał

[1] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [2] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [3] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [4] https://review.openstack.org/#/c/624730/

[5] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-...

Michał Dulko

20 Dec 20 Dec

8:18 p.m.

On Thu, 2018-12-13 at 09:06 -0800, Clark Boylan wrote:

...

On Thu, Dec 13, 2018, at 4:39 AM, Michał Dulko wrote:

...
Hi,

In Kuryr-Kubernetes we're using the DevStack-installed etcd as a backend store for Kubernetes that we run on our gates. For some time we can see its degraded performance manifesting like this [1] in the logs. Later on this leads to various K8s errors [2], [3], up to missing notifications from the API, which causes failures in Kuryr-Kubernetes tempest tests. From what I've seen those etcd warnings normally mean that disk latency is high.

This seems to be mostly happening on OVH and RAX hosts. I've looked at this with OVH folks and there isn't anything immediately alarming about their hosts running gate VM's.

That's interesting because we've been working with amorin at OVH over debugging similar IO problems and I think we both agree something is happening. We've disabled the BHS1 region as the vast majority of related failures were there, but kept GRA1 up and running which is where your example is from. My understanding is that a memory issue of some sort was found on the compute hypervisors (which could affect disk throughput if there isn't memory for caching available or if swap is using up available disk IO). We are currently waiting on amorin's go ahead to turn BHS1 back on after this is corrected.

...
Upgrading the etcd version doesn't seem to help, as well as patch [4] which increases IO priority for etcd process.

Any ideas of what I can try next? I think we're the only project that puts so much pressure on the DevStack's etcd. Help would really be welcomed, getting rid of this issue will greatly increase our gates stability.

It wouldn't surprise me if others aren't using etcd much. One thing that may help is to use the dstat data [5] from these failed jobs to rule out resource contention from within the job (cpu, io(ps), memory, etc). One thing we've found debugging these slower nodes is that it often exposes real bugs in our software by making them cost more. We should double check there isn't anything obvious like that happening here too.

There are multiple moments we're experiencing this, but mostly in interaction with K8s API, so of course software bug in there is possible. We've seen it earlier, but seems like failures became more often when we've updated K8s version.

...

I've been putting the csv file in https://lamada.eu/dstat-graph/ and that renders it for human consumption. But there are other tools out there for this too.

I've rendered it as well, but besides finding a bug in dstat [6] that made it show that kuryr-daemon is using 2 TB of RAM I didn't noticed anything special, especially on io operations. In one of most recent failures there's only a spike on processes and paging [7] around the time we see fatal failure. Any ideas how I should proceed? From the number of rechecks we're doing you can easily see that this is hitting us hard.

...

...
Thanks, Michał

[1] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [2] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [3] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [4] https://review.openstack.org/#/c/624730/

[5] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-...

[6] https://github.com/dagwieers/dstat/pull/162 [7] https://i.imgur.com/B8HQM8t.png

Michał Dulko

18 Jan 18 Jan

5:43 p.m.

It's been a while, but I finally have some more info. The issue still persists. We've tracked that etcd problem to high fsync duration on GRA1 hosts by looking at etcd_disk_wal_fsync_duration_seconds* metrics. Seems like on other clouds we rarely get fsync duration higher than 1 second, but on GRA1 ~1.5% of all fsync's take longer than that. I'm still gathering more info and working with OVH folks to find the root cause. Meanwhile I've tested a simple idea to put etcd data directory on a RAM disk. This unsurprisingly seems to help, so I'll be preparing the patch [1] to be mergeable and can use some reviews once that's done. [1] https://review.openstack.org/#/c/626885/ On Thu, 2018-12-20 at 16:18 +0100, Michał Dulko wrote:

...

On Thu, 2018-12-13 at 09:06 -0800, Clark Boylan wrote:

...
On Thu, Dec 13, 2018, at 4:39 AM, Michał Dulko wrote:

...
Hi,

In Kuryr-Kubernetes we're using the DevStack-installed etcd as a backend store for Kubernetes that we run on our gates. For some time we can see its degraded performance manifesting like this [1] in the logs. Later on this leads to various K8s errors [2], [3], up to missing notifications from the API, which causes failures in Kuryr-Kubernetes tempest tests. From what I've seen those etcd warnings normally mean that disk latency is high.

This seems to be mostly happening on OVH and RAX hosts. I've looked at this with OVH folks and there isn't anything immediately alarming about their hosts running gate VM's.

That's interesting because we've been working with amorin at OVH over debugging similar IO problems and I think we both agree something is happening. We've disabled the BHS1 region as the vast majority of related failures were there, but kept GRA1 up and running which is where your example is from. My understanding is that a memory issue of some sort was found on the compute hypervisors (which could affect disk throughput if there isn't memory for caching available or if swap is using up available disk IO). We are currently waiting on amorin's go ahead to turn BHS1 back on after this is corrected.

...
Upgrading the etcd version doesn't seem to help, as well as patch [4] which increases IO priority for etcd process.

Any ideas of what I can try next? I think we're the only project that puts so much pressure on the DevStack's etcd. Help would really be welcomed, getting rid of this issue will greatly increase our gates stability.

It wouldn't surprise me if others aren't using etcd much. One thing that may help is to use the dstat data [5] from these failed jobs to rule out resource contention from within the job (cpu, io(ps), memory, etc). One thing we've found debugging these slower nodes is that it often exposes real bugs in our software by making them cost more. We should double check there isn't anything obvious like that happening here too.

There are multiple moments we're experiencing this, but mostly in interaction with K8s API, so of course software bug in there is possible. We've seen it earlier, but seems like failures became more often when we've updated K8s version.

...
I've been putting the csv file in https://lamada.eu/dstat-graph/ and that renders it for human consumption. But there are other tools out there for this too.

I've rendered it as well, but besides finding a bug in dstat [6] that made it show that kuryr-daemon is using 2 TB of RAM I didn't noticed anything special, especially on io operations. In one of most recent failures there's only a spike on processes and paging [7] around the time we see fatal failure.

Any ideas how I should proceed? From the number of rechecks we're doing you can easily see that this is hitting us hard.

...
...
Thanks, Michał

[1] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [2] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [3] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-... [4] https://review.openstack.org/#/c/624730/

[5] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-...

[6] https://github.com/dagwieers/dstat/pull/162 [7] https://i.imgur.com/B8HQM8t.png

2469

Age (days ago)

2505

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

Ben Nemec
Clark Boylan
Michał Dulko