Open Stack

Fri Jan 18 12:43:01 UTC 2019

It's been a while, but I finally have some more info. The issue still
persists. We've tracked that etcd problem to high fsync duration on
GRA1  hosts by looking at etcd_disk_wal_fsync_duration_seconds*
metrics. Seems like on other clouds we rarely get fsync duration higher
than 1 second, but on GRA1 ~1.5% of all fsync's take longer than that.

I'm still gathering more info and working with OVH folks to find the
root cause.

Meanwhile I've tested a simple idea to put etcd data directory on a RAM
disk. This unsurprisingly seems to help, so I'll be preparing the patch
[1] to be mergeable and can use some reviews once that's done.

[1] https://review.openstack.org/#/c/626885/

On Thu, 2018-12-20 at 16:18 +0100, Michał Dulko wrote:
> On Thu, 2018-12-13 at 09:06 -0800, Clark Boylan wrote:
> > On Thu, Dec 13, 2018, at 4:39 AM, Michał Dulko wrote:
> > > Hi,
> > > 
> > > In Kuryr-Kubernetes we're using the DevStack-installed etcd as a
> > > backend store for Kubernetes that we run on our gates. For some time we
> > > can see its degraded performance manifesting like this [1] in the logs.
> > > Later on this leads to various K8s errors [2], [3], up to missing
> > > notifications from the API, which causes failures in Kuryr-Kubernetes
> > > tempest tests. From what I've seen those etcd warnings normally mean
> > > that disk latency is high.
> > > 
> > > This seems to be mostly happening on OVH and RAX hosts. I've looked at
> > > this with OVH folks and there isn't anything immediately alarming about
> > > their hosts running gate VM's.
> > 
> > That's interesting because we've been working with amorin at OVH over
> > debugging similar IO problems and I think we both agree something is
> > happening. We've disabled the BHS1 region as the vast majority of
> > related failures were there, but kept GRA1 up and running which is
> > where your example is from. My understanding is that a memory issue
> > of some sort was found on the compute hypervisors (which could affect
> > disk throughput if there isn't memory for caching available or if
> > swap is using up available disk IO). We are currently waiting on
> > amorin's go ahead to turn BHS1 back on after this is corrected.
> > 
> > > Upgrading the etcd version doesn't seem to help, as well as patch [4]
> > > which increases IO priority for etcd process.
> > > 
> > > Any ideas of what I can try next? I think we're the only project that
> > > puts so much pressure on the DevStack's etcd. Help would really be
> > > welcomed, getting rid of this issue will greatly increase our gates
> > > stability.
> > 
> > It wouldn't surprise me if others aren't using etcd much. One thing
> > that may help is to use the dstat data [5] from these failed jobs to
> > rule out resource contention from within the job (cpu, io(ps),
> > memory, etc). One thing we've found debugging these slower nodes is
> > that it often exposes real bugs in our software by making them cost
> > more. We should double check there isn't anything obvious like that
> > happening here too.
> 
> There are multiple moments we're experiencing this, but mostly in
> interaction with K8s API, so of course software bug in there is
> possible. We've seen it earlier, but seems like failures became more
> often when we've updated K8s version.
> 
> > I've been putting the csv file in https://lamada.eu/dstat-graph/ and
> > that renders it for human consumption. But there are other tools out
> > there for this too.
> 
> I've rendered it as well, but besides finding a bug in dstat [6] that
> made it show that kuryr-daemon is using 2 TB of RAM I didn't noticed
> anything special, especially on io operations. In one of most recent
> failures there's only a spike on processes and paging [7] around the
> time we see fatal failure.
> 
> Any ideas how I should proceed? From the number of rechecks we're doing
> you can easily see that this is hitting us hard.
> 
> > > Thanks,
> > > Michał
> > > 
> > > [1] 
> > > http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-octavia/4a47162/controller/logs/screen-etcd.txt.gz#_Dec_12_17_19_33_618619
> > > [2] 
> > > http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-octavia/4a47162/controller/logs/screen-kubernetes-api.txt.gz#_Dec_12_17_20_19_772688
> > > [3] 
> > > http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-octavia/4a47162/controller/logs/screen-kubernetes-scheduler.txt.gz#_Dec_12_17_18_59_045347
> > > [4] https://review.openstack.org/#/c/624730/
> > > 
> > > 
> > 
> > [5] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-octavia/4a47162/controller/logs/dstat-csv_log.txt
> > 
> 
> [6] https://github.com/dagwieers/dstat/pull/162
> [7] https://i.imgur.com/B8HQM8t.png

Open Stack

[dev] [infra] [devstack] [qa] [kuryr] DevStack's etcd performance on gate VM's

OpenStack

Community

Documentation

Branding & Legal