[Openstack-operators] [nova] nova-compute automatically disabling itself?

Chris Apsey bitskrieg at bitskrieg.net
Wed Jan 31 21:16:16 UTC 2018


Running in to a strange issue I haven't seen before.

Randomly, the nova-compute services on compute nodes are disabling 
themselves (as if someone ran openstack compute service set --disable 
hostX nova-compute.  When this happens, the node continues to report 
itself as 'up' - the service is just disabled.  As a result, if enough 
of these occur, we get scheduling errors due to lack of available 
resources (which makes sense).  Re-enabling them works just fine and 
they continue on as if nothing happened.  I looked through the logs and 
I can find the API calls where we re-enable the services (PUT 
/v2.1/os-services/enable), but I do not see any API calls where the 
services are getting disabled initially.

Is anyone aware of any cases where compute nodes will automatically 
disable their nova-compute service on their own, or has anyone seen this 
before and might know a root cause?  We have plenty of spare vcpus and 
RAM on each node - like less than 25% utilization (both in absolute 
terms and in terms of applied ratios).

We're seeing follow-on errors regarding rmq messages getting lost and 
vif-plug failures, but we think those are a symptom, not a cause.

Currently running pike on Xenial.


Chris Apsey
bitskrieg at bitskrieg.net

