On 11/4/2019 6:58 PM, Clark Boylan wrote:
Typically we try to work with the clouds to properly root cause the issue. Then from there we can figure out what the best fix may be. They are running our software after all and there is a good chance the problems are in openstack.
I'm in shanghai at the moment but if others want to reach out feel free. benj_ and mgagne are at inap and amorin has been helpful at ovh. The test node logs include a hostid in them somewhere which an be used to identify hypervisors if necessary.
I noticed this today [1]. That doesn't always result in failed jobs but I correlated it to a failure in a timeout in a nova functional job [2] and those normally don't have these types of problems. Note the correlation to when it spikes, midnight and noon it looks like. The dip on 11/2 and 11/3 was the weekend. And it's mostly OVH nodes. So they must have some kind of cron or something that hits at those times? Anecdotally, I'll also note that it seems like the gate is much more stable this week while the summit is happening. We're actually able to merge some changes in nova which is kind of amazing given the last month or so of rechecks we've had to do. [1] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Function%20'nova.servicegroup.drivers.db.DbDriver._report_state'%20run%20outlasted%20interval%20by%5C%22&from=7d [2] https://zuul.opendev.org/t/openstack/build/63001bbd58c244cea70c995f1ebf61fb/... -- Thanks, Matt