On 11/4/19 16:58, Clark Boylan wrote:
On Mon, Nov 4, 2019, at 7:37 PM, Chris Dent wrote:
On Fri, 1 Nov 2019, Matt Riedemann wrote:
On 11/1/2019 9:55 AM, Clark Boylan wrote:
OVH controls the disk IOPs that we get pretty aggressively as well. Possible it is an IO thing?
Yeah, so looking at the dstat output in that graph (thanks for pointing out that site, really nice) we basically have 0 I/O from 16:53 to 16:55, so uh, that's probably not good.
What happens in a case like this? Is there an official procedure for "hey, can you give is more IO?" or (if that's not an option) "can you give us less CPU?". Is that something that is automated, is is something that is monitored and alarming? "INAP ran out of IO X times in the last N hours, light the beacons!"
Typically we try to work with the clouds to properly root cause the issue. Then from there we can figure out what the best fix may be. They are running our software after all and there is a good chance the problems are in openstack.
I'm in shanghai at the moment but if others want to reach out feel free. benj_ and mgagne are at inap and amorin has been helpful at ovh. The test node logs include a hostid in them somewhere which an be used to identify hypervisors if necessary.
Just wanted to throw this out there to the ML in case anyone has any thoughts: Since we know that I/O is overloaded in these cases, would it make any sense to have infra/tempest use a flavor which sets disk I/O quotas [1] to help prevent any one process from getting starved out? I agree that properly troubleshooting the root cause is necessary and maybe adding limits would not be desired for concern of it potentially hiding issues. -melanie [1] https://docs.openstack.org/nova/latest/user/flavors.html#extra-specs-disk-tu...