[Openstack-operators] [nova] nova-compute automatically disabling itself?

Chris Apsey bitskrieg at bitskrieg.net
Wed Jan 31 21:47:06 UTC 2018


That looks promising.  I'll report back to confirm the solution.

Thanks!

---
v/r

Chris Apsey
bitskrieg at bitskrieg.net
https://www.bitskrieg.net

On 2018-01-31 04:40 PM, Matt Riedemann wrote:
> On 1/31/2018 3:16 PM, Chris Apsey wrote:
>> All,
>> 
>> Running in to a strange issue I haven't seen before.
>> 
>> Randomly, the nova-compute services on compute nodes are disabling 
>> themselves (as if someone ran openstack compute service set --disable 
>> hostX nova-compute.  When this happens, the node continues to report 
>> itself as 'up' - the service is just disabled.  As a result, if enough 
>> of these occur, we get scheduling errors due to lack of available 
>> resources (which makes sense).  Re-enabling them works just fine and 
>> they continue on as if nothing happened.  I looked through the logs 
>> and I can find the API calls where we re-enable the services (PUT 
>> /v2.1/os-services/enable), but I do not see any API calls where the 
>> services are getting disabled initially.
>> 
>> Is anyone aware of any cases where compute nodes will automatically 
>> disable their nova-compute service on their own, or has anyone seen 
>> this before and might know a root cause?  We have plenty of spare 
>> vcpus and RAM on each node - like less than 25% utilization (both in 
>> absolute terms and in terms of applied ratios).
>> 
>> We're seeing follow-on errors regarding rmq messages getting lost and 
>> vif-plug failures, but we think those are a symptom, not a cause.
>> 
>> Currently running pike on Xenial.
>> 
>> ---
>> v/r
>> 
>> Chris Apsey
>> bitskrieg at bitskrieg.net
>> https://www.bitskrieg.net
>> 
>> _______________________________________________
>> OpenStack-operators mailing list
>> OpenStack-operators at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> 
> 
> This is actually a feature added in Pike:
> 
> https://review.openstack.org/#/c/463597/
> 
> This came up in discussion with operators at the Forum in Boston.
> 
> The vif-plug failures are likely the reason those computes are getting 
> disabled.
> 
> There is a config option "consecutive_build_service_disable_threshold"
> which you can set to disable the auto-disable behavior as some have
> experienced issues with it:
> 
> https://bugs.launchpad.net/nova/+bug/1742102



More information about the OpenStack-operators mailing list