[Openstack-operators] [nova] nova-compute automatically disabling itself?

Matt Riedemann mriedemos at gmail.com
Wed Feb 7 00:44:41 UTC 2018


On 2/6/2018 2:14 PM, Chris Apsey wrote:
> but we would rather have intermittent build failures rather than compute 
> nodes falling over in the future.

Note that once a compute has a successful build, the consecutive build 
failures counter is reset. So if your limit is the default (10) and you 
have 10 failures in a row, the compute service is auto-disabled. But if 
you have say 5 failures and then a pass, it's reset to 0 failures.

Obviously if you're doing a pack-first scheduling strategy rather than 
spreading instances across the deployment, a burst of failures could 
easily disable a compute, especially if that host is overloaded like you 
saw. I'm not sure if rescheduling is helping you or not - that would be 
useful information since we consider the need to reschedule off a failed 
compute host as a bad thing. At the Forum in Boston when this idea came 
up, it was specifically for the case that operators in the room didn't 
want a bad compute to become a "black hole" in their deployment causing 
lots of reschedules until they get that one fixed.

-- 

Thanks,

Matt



More information about the OpenStack-operators mailing list