[dev] [infra] [devstack] [qa] [kuryr] DevStack's etcd performance on gate VM's

Michał Dulko mdulko at redhat.com
Thu Dec 20 15:18:38 UTC 2018


On Thu, 2018-12-13 at 09:06 -0800, Clark Boylan wrote:
> On Thu, Dec 13, 2018, at 4:39 AM, Michał Dulko wrote:
> > Hi,
> > 
> > In Kuryr-Kubernetes we're using the DevStack-installed etcd as a
> > backend store for Kubernetes that we run on our gates. For some time we
> > can see its degraded performance manifesting like this [1] in the logs.
> > Later on this leads to various K8s errors [2], [3], up to missing
> > notifications from the API, which causes failures in Kuryr-Kubernetes
> > tempest tests. From what I've seen those etcd warnings normally mean
> > that disk latency is high.
> > 
> > This seems to be mostly happening on OVH and RAX hosts. I've looked at
> > this with OVH folks and there isn't anything immediately alarming about
> > their hosts running gate VM's.
> 
> That's interesting because we've been working with amorin at OVH over
> debugging similar IO problems and I think we both agree something is
> happening. We've disabled the BHS1 region as the vast majority of
> related failures were there, but kept GRA1 up and running which is
> where your example is from. My understanding is that a memory issue
> of some sort was found on the compute hypervisors (which could affect
> disk throughput if there isn't memory for caching available or if
> swap is using up available disk IO). We are currently waiting on
> amorin's go ahead to turn BHS1 back on after this is corrected.
> 
> > Upgrading the etcd version doesn't seem to help, as well as patch [4]
> > which increases IO priority for etcd process.
> > 
> > Any ideas of what I can try next? I think we're the only project that
> > puts so much pressure on the DevStack's etcd. Help would really be
> > welcomed, getting rid of this issue will greatly increase our gates
> > stability.
> 
> It wouldn't surprise me if others aren't using etcd much. One thing
> that may help is to use the dstat data [5] from these failed jobs to
> rule out resource contention from within the job (cpu, io(ps),
> memory, etc). One thing we've found debugging these slower nodes is
> that it often exposes real bugs in our software by making them cost
> more. We should double check there isn't anything obvious like that
> happening here too.

There are multiple moments we're experiencing this, but mostly in
interaction with K8s API, so of course software bug in there is
possible. We've seen it earlier, but seems like failures became more
often when we've updated K8s version.

> I've been putting the csv file in https://lamada.eu/dstat-graph/ and
> that renders it for human consumption. But there are other tools out
> there for this too.

I've rendered it as well, but besides finding a bug in dstat [6] that
made it show that kuryr-daemon is using 2 TB of RAM I didn't noticed
anything special, especially on io operations. In one of most recent
failures there's only a spike on processes and paging [7] around the
time we see fatal failure.

Any ideas how I should proceed? From the number of rechecks we're doing
you can easily see that this is hitting us hard.

> > Thanks,
> > Michał
> > 
> > [1] 
> > http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-octavia/4a47162/controller/logs/screen-etcd.txt.gz#_Dec_12_17_19_33_618619
> > [2] 
> > http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-octavia/4a47162/controller/logs/screen-kubernetes-api.txt.gz#_Dec_12_17_20_19_772688
> > [3] 
> > http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-octavia/4a47162/controller/logs/screen-kubernetes-scheduler.txt.gz#_Dec_12_17_18_59_045347
> > [4] https://review.openstack.org/#/c/624730/
> > 
> > 
> 
> [5] http://logs.openstack.org/49/624749/1/check/kuryr-kubernetes-tempest-daemon-octavia/4a47162/controller/logs/dstat-csv_log.txt
> 

[6] https://github.com/dagwieers/dstat/pull/162
[7] https://i.imgur.com/B8HQM8t.png




More information about the openstack-discuss mailing list