Re: State of the Gate (placement?)

1 Nov 2019

      On Fri, Nov 1, 2019, at 6:30 AM, Matt Riedemann wrote:
...
On 11/1/2019 5:34 AM, Chris Dent wrote:
...
Instead it's "just being slow" which could be any of:
* something in the auth process (which if other services are also
   being slow could be a unifying thing)
* the database server having a sad (do we keep a slow query log?)
Not by default but it can be enabled in devstack, like [1] but the 
resulting log file is so big I can't open it in my browser. So hitting 
this kind of issue with that enabled is going to be like finding a 
needle in a haystack I think.
...
* the vm/vms has/have a noisy neighbor
This is my guess as to the culprit. Like Slawek said elsewhere, he's 
seen this in other APIs that are otherwise really fast.
...
Do we know anything about cpu and io stats during the slow period?
We have peakmemtracker [2] which shows a jump during the slow period 
noticed:
Oct 31 16:51:43.132081 ubuntu-bionic-inap-mtl01-0012620879 
memory_tracker.sh[26186]: [iscsid (pid:18546)]=34352KB; [dmeventd 
(pid:18039)]=119732KB; [ovs-vswitchd (pid:3626)]=691960KB
Oct 31 16:51:43.133012 ubuntu-bionic-inap-mtl01-0012620879 
memory_tracker.sh[26186]: ]]]
Oct 31 16:55:23.220099 ubuntu-bionic-inap-mtl01-0012620879 
memory_tracker.sh[26186]: [[[
Oct 31 16:55:23.221255 ubuntu-bionic-inap-mtl01-0012620879 
memory_tracker.sh[26186]: Thu Oct 31 16:55:23 UTC 2019
I don't know what that means though.
We also have dstat [3] (which I find hard as hell to read - do people 
throw that into a nice graphical tool to massage that data for 
inspection?) which shows a jump as well:
I put the dstat csv files in https://lamada.eu/dstat-graph/ and that works reasonably well.
...
Oct 31 16:53:31.515005 ubuntu-bionic-inap-mtl01-0012620879 
dstat.sh[25748]: 31-10 16:53:31| 19   7  12  61   0|5806M  509M  350M 
1194M| 612B    0 |   0     0 |   0     0 |1510  2087 |17.7 9.92 5.82|3.0 
  13 2.0|   0     0 |qemu-system-x86      24768 13%   0    0 |python2 
           25797  99k 451B  0%|mysqld       464M|  20M 8172M|  37  513 
  0    4   12
Oct 31 16:55:08.604634 ubuntu-bionic-inap-mtl01-0012620879 
dstat.sh[25748]: 31-10 16:53:32| 20   7  12  61   0|5806M  509M  350M 
1194M|1052B    0 |   0     0 |   0     0 |1495  2076 |17.7 9.92 5.82|4.0 
  13   0|   0     0 |qemu-system-x86      24768 13%   0    0 |python2 
           25797  99k 446B0.1%|mysqld       464M|  20M 8172M|  37  513 
  0    4   12
Looks like around the time of slowness mysqld is the top consuming 
process which is probably not surprising.
...
We've (sorry, let me rephrase, I've) known for the entire 5.5 years
I've been involved in OpenStack that the CI resources are way
over-subscribed and controller nodes in the wild are typically way
under-specified.  Yes, our code should be robust in the face of
that, but...
Looking at logstash this mostly hits on OVH and INAP nodes. Question to 
infra: do we know if those are more oversubscribed than some others like 
RAX or VEXXHOST nodes?
I believe both OVH and INAP give us dedicated hypervisors. This means that we will end up being our own noisy neighbors there. I don't know what level we oversubscribe at but amorin (OVH) and benj_ (INAP) can probably provide more info.

INAP was also recently turned back on. It had been offline for redeployment and that was completed and added back to the pool. Possible that more than just the openstack version has changed?

OVH controls the disk IOPs that we get pretty aggressively as well. Possible it is an IO thing?
...
[1] https://review.opendev.org/#/c/691995/
[2] 
https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcd...
[3] 
https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcd...

Re: State of the Gate (placement?)

Clark Boylan