[Openstack-operators] [nova] nova-compute automatically disabling itself?

Matt Riedemann mriedemos at gmail.com
Wed Jan 31 21:40:36 UTC 2018


On 1/31/2018 3:16 PM, Chris Apsey wrote:
> All,
> 
> Running in to a strange issue I haven't seen before.
> 
> Randomly, the nova-compute services on compute nodes are disabling 
> themselves (as if someone ran openstack compute service set --disable 
> hostX nova-compute.  When this happens, the node continues to report 
> itself as 'up' - the service is just disabled.  As a result, if enough 
> of these occur, we get scheduling errors due to lack of available 
> resources (which makes sense).  Re-enabling them works just fine and 
> they continue on as if nothing happened.  I looked through the logs and 
> I can find the API calls where we re-enable the services (PUT 
> /v2.1/os-services/enable), but I do not see any API calls where the 
> services are getting disabled initially.
> 
> Is anyone aware of any cases where compute nodes will automatically 
> disable their nova-compute service on their own, or has anyone seen this 
> before and might know a root cause?  We have plenty of spare vcpus and 
> RAM on each node - like less than 25% utilization (both in absolute 
> terms and in terms of applied ratios).
> 
> We're seeing follow-on errors regarding rmq messages getting lost and 
> vif-plug failures, but we think those are a symptom, not a cause.
> 
> Currently running pike on Xenial.
> 
> ---
> v/r
> 
> Chris Apsey
> bitskrieg at bitskrieg.net
> https://www.bitskrieg.net
> 
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


This is actually a feature added in Pike:

https://review.openstack.org/#/c/463597/

This came up in discussion with operators at the Forum in Boston.

The vif-plug failures are likely the reason those computes are getting 
disabled.

There is a config option "consecutive_build_service_disable_threshold" 
which you can set to disable the auto-disable behavior as some have 
experienced issues with it:

https://bugs.launchpad.net/nova/+bug/1742102

-- 

Thanks,

Matt



More information about the OpenStack-operators mailing list