State of the Gate (placement?)
Matt Riedemann
mriedemos at gmail.com
Fri Nov 1 13:30:59 UTC 2019
On 11/1/2019 5:34 AM, Chris Dent wrote:
> Instead it's "just being slow" which could be any of:
>
> * something in the auth process (which if other services are also
> being slow could be a unifying thing)
> * the database server having a sad (do we keep a slow query log?)
Not by default but it can be enabled in devstack, like [1] but the
resulting log file is so big I can't open it in my browser. So hitting
this kind of issue with that enabled is going to be like finding a
needle in a haystack I think.
> * the vm/vms has/have a noisy neighbor
This is my guess as to the culprit. Like Slawek said elsewhere, he's
seen this in other APIs that are otherwise really fast.
>
> Do we know anything about cpu and io stats during the slow period?
We have peakmemtracker [2] which shows a jump during the slow period
noticed:
Oct 31 16:51:43.132081 ubuntu-bionic-inap-mtl01-0012620879
memory_tracker.sh[26186]: [iscsid (pid:18546)]=34352KB; [dmeventd
(pid:18039)]=119732KB; [ovs-vswitchd (pid:3626)]=691960KB
Oct 31 16:51:43.133012 ubuntu-bionic-inap-mtl01-0012620879
memory_tracker.sh[26186]: ]]]
Oct 31 16:55:23.220099 ubuntu-bionic-inap-mtl01-0012620879
memory_tracker.sh[26186]: [[[
Oct 31 16:55:23.221255 ubuntu-bionic-inap-mtl01-0012620879
memory_tracker.sh[26186]: Thu Oct 31 16:55:23 UTC 2019
I don't know what that means though.
We also have dstat [3] (which I find hard as hell to read - do people
throw that into a nice graphical tool to massage that data for
inspection?) which shows a jump as well:
Oct 31 16:53:31.515005 ubuntu-bionic-inap-mtl01-0012620879
dstat.sh[25748]: 31-10 16:53:31| 19 7 12 61 0|5806M 509M 350M
1194M| 612B 0 | 0 0 | 0 0 |1510 2087 |17.7 9.92 5.82|3.0
13 2.0| 0 0 |qemu-system-x86 24768 13% 0 0 |python2
25797 99k 451B 0%|mysqld 464M| 20M 8172M| 37 513
0 4 12
Oct 31 16:55:08.604634 ubuntu-bionic-inap-mtl01-0012620879
dstat.sh[25748]: 31-10 16:53:32| 20 7 12 61 0|5806M 509M 350M
1194M|1052B 0 | 0 0 | 0 0 |1495 2076 |17.7 9.92 5.82|4.0
13 0| 0 0 |qemu-system-x86 24768 13% 0 0 |python2
25797 99k 446B0.1%|mysqld 464M| 20M 8172M| 37 513
0 4 12
Looks like around the time of slowness mysqld is the top consuming
process which is probably not surprising.
>
> We've (sorry, let me rephrase, I've) known for the entire 5.5 years
> I've been involved in OpenStack that the CI resources are way
> over-subscribed and controller nodes in the wild are typically way
> under-specified. Yes, our code should be robust in the face of
> that, but...
Looking at logstash this mostly hits on OVH and INAP nodes. Question to
infra: do we know if those are more oversubscribed than some others like
RAX or VEXXHOST nodes?
[1] https://review.opendev.org/#/c/691995/
[2]
https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcdn.com/692206/3/check/tempest-integrated-compute/35ecb4a/controller/logs/screen-peakmem_tracker.txt.gz
[3]
https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcdn.com/692206/3/check/tempest-integrated-compute/35ecb4a/controller/logs/screen-dstat.txt.gz
--
Thanks,
Matt
More information about the openstack-discuss
mailing list