On Fri, Nov 1, 2019, at 6:30 AM, Matt Riedemann wrote:
On 11/1/2019 5:34 AM, Chris Dent wrote:
Instead it's "just being slow" which could be any of:
* something in the auth process (which if other services are also being slow could be a unifying thing) * the database server having a sad (do we keep a slow query log?)
Not by default but it can be enabled in devstack, like [1] but the resulting log file is so big I can't open it in my browser. So hitting this kind of issue with that enabled is going to be like finding a needle in a haystack I think.
* the vm/vms has/have a noisy neighbor
This is my guess as to the culprit. Like Slawek said elsewhere, he's seen this in other APIs that are otherwise really fast.
Do we know anything about cpu and io stats during the slow period?
We have peakmemtracker [2] which shows a jump during the slow period noticed:
Oct 31 16:51:43.132081 ubuntu-bionic-inap-mtl01-0012620879 memory_tracker.sh[26186]: [iscsid (pid:18546)]=34352KB; [dmeventd (pid:18039)]=119732KB; [ovs-vswitchd (pid:3626)]=691960KB Oct 31 16:51:43.133012 ubuntu-bionic-inap-mtl01-0012620879 memory_tracker.sh[26186]: ]]] Oct 31 16:55:23.220099 ubuntu-bionic-inap-mtl01-0012620879 memory_tracker.sh[26186]: [[[ Oct 31 16:55:23.221255 ubuntu-bionic-inap-mtl01-0012620879 memory_tracker.sh[26186]: Thu Oct 31 16:55:23 UTC 2019
I don't know what that means though.
We also have dstat [3] (which I find hard as hell to read - do people throw that into a nice graphical tool to massage that data for inspection?) which shows a jump as well:
I put the dstat csv files in https://lamada.eu/dstat-graph/ and that works reasonably well.
Oct 31 16:53:31.515005 ubuntu-bionic-inap-mtl01-0012620879 dstat.sh[25748]: 31-10 16:53:31| 19 7 12 61 0|5806M 509M 350M 1194M| 612B 0 | 0 0 | 0 0 |1510 2087 |17.7 9.92 5.82|3.0 13 2.0| 0 0 |qemu-system-x86 24768 13% 0 0 |python2 25797 99k 451B 0%|mysqld 464M| 20M 8172M| 37 513 0 4 12 Oct 31 16:55:08.604634 ubuntu-bionic-inap-mtl01-0012620879 dstat.sh[25748]: 31-10 16:53:32| 20 7 12 61 0|5806M 509M 350M 1194M|1052B 0 | 0 0 | 0 0 |1495 2076 |17.7 9.92 5.82|4.0 13 0| 0 0 |qemu-system-x86 24768 13% 0 0 |python2 25797 99k 446B0.1%|mysqld 464M| 20M 8172M| 37 513 0 4 12
Looks like around the time of slowness mysqld is the top consuming process which is probably not surprising.
We've (sorry, let me rephrase, I've) known for the entire 5.5 years I've been involved in OpenStack that the CI resources are way over-subscribed and controller nodes in the wild are typically way under-specified. Yes, our code should be robust in the face of that, but...
Looking at logstash this mostly hits on OVH and INAP nodes. Question to infra: do we know if those are more oversubscribed than some others like RAX or VEXXHOST nodes?
I believe both OVH and INAP give us dedicated hypervisors. This means that we will end up being our own noisy neighbors there. I don't know what level we oversubscribe at but amorin (OVH) and benj_ (INAP) can probably provide more info. INAP was also recently turned back on. It had been offline for redeployment and that was completed and added back to the pool. Possible that more than just the openstack version has changed? OVH controls the disk IOPs that we get pretty aggressively as well. Possible it is an IO thing?
[1] https://review.opendev.org/#/c/691995/ [2] https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcd... [3] https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcd...