Open Stack

Thu Jun 7 14:32:20 UTC 2018

On 2/6/2018 6:44 PM, Matt Riedemann wrote:
> On 2/6/2018 2:14 PM, Chris Apsey wrote:
>> but we would rather have intermittent build failures rather than 
>> compute nodes falling over in the future.
> 
> Note that once a compute has a successful build, the consecutive build 
> failures counter is reset. So if your limit is the default (10) and you 
> have 10 failures in a row, the compute service is auto-disabled. But if 
> you have say 5 failures and then a pass, it's reset to 0 failures.
> 
> Obviously if you're doing a pack-first scheduling strategy rather than 
> spreading instances across the deployment, a burst of failures could 
> easily disable a compute, especially if that host is overloaded like you 
> saw. I'm not sure if rescheduling is helping you or not - that would be 
> useful information since we consider the need to reschedule off a failed 
> compute host as a bad thing. At the Forum in Boston when this idea came 
> up, it was specifically for the case that operators in the room didn't 
> want a bad compute to become a "black hole" in their deployment causing 
> lots of reschedules until they get that one fixed.

Just an update on this. There is a change merged in Rocky [1] which is 
also going through backports to Queens and Pike. If you've already 
disabled the "consecutive_build_service_disable_threshold" config option 
then it's a no-op. If you haven't, 
"consecutive_build_service_disable_threshold" is now used to count build 
failures but no longer auto-disable the compute service on the 
configured threshold is met (10 by default). The build failure count is 
then used by a new weigher (enabled by default) to sort hosts with build 
failures to the back of the list of candidate hosts for new builds. Once 
there is a successful build on a given host, the failure count is reset. 
The idea here is that hosts which are failing are given lower priority 
during scheduling.

[1] https://review.openstack.org/#/c/572195/

-- 

Thanks,

Matt

Open Stack

[Openstack-operators] [nova] nova-compute automatically disabling itself?

OpenStack

Community

Documentation

Branding & Legal