On 11/1/2019 9:55 AM, Clark Boylan wrote:
INAP was also recently turned back on. It had been offline for redeployment and that was completed and added back to the pool. Possible that more than just the openstack version has changed?
OVH controls the disk IOPs that we get pretty aggressively as well. Possible it is an IO thing?
Related to slow nodes, I noticed this failed recently, it's a synchronous RPC call from nova-api to nova-compute that timed out after 60 seconds [1]. Looking at MessagingTimeout errors in the nova-api logs shows it's mostly in INAP and OVH nodes [2] so there seems to be a pattern emerging with those being slow nodes causing issues. There are ways we could workaround this a bit on the nova side [3] but I'm not sure how much we want to make parts of nova super resilient to very slow nodes when real life operations would probably need to know about this kind of thing to scale up/out their control plane. [1] https://zuul.opendev.org/t/openstack/build/ef0196fe84804b44ac106d011c8c29ea/... [2] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22MessagingTimeout%5C%22%20AND%20tags%3A%5C%22screen-n-api.txt%5C%22&from=7d [3] https://review.opendev.org/#/c/692550/ -- Thanks, Matt