Re: State of the Gate (placement?)

4 Nov 2019


      On 11/1/2019 9:55 AM, Clark Boylan wrote:
...
INAP was also recently turned back on. It had been offline for redeployment and that was completed and added back to the pool. Possible that more than just the openstack version has changed?
OVH controls the disk IOPs that we get pretty aggressively as well. Possible it is an IO thing?
Related to slow nodes, I noticed this failed recently, it's a 
synchronous RPC call from nova-api to nova-compute that timed out after 
60 seconds [1]. Looking at MessagingTimeout errors in the nova-api logs 
shows it's mostly in INAP and OVH nodes [2] so there seems to be a 
pattern emerging with those being slow nodes causing issues. There are 
ways we could workaround this a bit on the nova side [3] but I'm not 
sure how much we want to make parts of nova super resilient to very slow 
nodes when real life operations would probably need to know about this 
kind of thing to scale up/out their control plane.

[1] 
https://zuul.opendev.org/t/openstack/build/ef0196fe84804b44ac106d011c8c29ea/...
[2] 
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22MessagingTimeout%5C%22%20AND%20tags%3A%5C%22screen-n-api.txt%5C%22&from=7d
[3] https://review.opendev.org/#/c/692550/

-- 

Thanks,

Matt

Re: State of the Gate (placement?)

Matt Riedemann