[Openstack-operators] [nova] nova-compute automatically disabling itself?
Chris Apsey
bitskrieg at bitskrieg.net
Tue Feb 6 20:14:46 UTC 2018
All,
This was the core issue - setting
consecutive_build_service_disable_threshold = 0 in nova.conf (on
controllers and compute nodes) solved this. It was being triggered by
neutron dropping requests (and/or responses) for vif-plugging due to cpu
usage on the neutron endpoints being pegged at 100% for too long. We
increased our rpc_response_timeout value and this issue appears to be
resolved for the time being. We can probably safely remove the
consecutive_build_service_disable_threshold option at this point, but we
would rather have intermittent build failures rather than compute nodes
falling over in the future.
Slightly related, we are noticing that neutron endpoints are using
noticeably more CPU time recently than in the past w/ a similar workload
(we run linuxbridge w/ vxlan). We believe this is tied to our
application of KPTI for meltdown mitigation across the various hosts in
our cluster (the timeline matches). Has anyone else experienced similar
impacts or can suggest anything to try to lessen the impact?
---
v/r
Chris Apsey
bitskrieg at bitskrieg.net
https://www.bitskrieg.net
On 2018-01-31 04:47 PM, Chris Apsey wrote:
> That looks promising. I'll report back to confirm the solution.
>
> Thanks!
>
> ---
> v/r
>
> Chris Apsey
> bitskrieg at bitskrieg.net
> https://www.bitskrieg.net
>
> On 2018-01-31 04:40 PM, Matt Riedemann wrote:
>> On 1/31/2018 3:16 PM, Chris Apsey wrote:
>>> All,
>>>
>>> Running in to a strange issue I haven't seen before.
>>>
>>> Randomly, the nova-compute services on compute nodes are disabling
>>> themselves (as if someone ran openstack compute service set --disable
>>> hostX nova-compute. When this happens, the node continues to report
>>> itself as 'up' - the service is just disabled. As a result, if
>>> enough of these occur, we get scheduling errors due to lack of
>>> available resources (which makes sense). Re-enabling them works just
>>> fine and they continue on as if nothing happened. I looked through
>>> the logs and I can find the API calls where we re-enable the services
>>> (PUT /v2.1/os-services/enable), but I do not see any API calls where
>>> the services are getting disabled initially.
>>>
>>> Is anyone aware of any cases where compute nodes will automatically
>>> disable their nova-compute service on their own, or has anyone seen
>>> this before and might know a root cause? We have plenty of spare
>>> vcpus and RAM on each node - like less than 25% utilization (both in
>>> absolute terms and in terms of applied ratios).
>>>
>>> We're seeing follow-on errors regarding rmq messages getting lost and
>>> vif-plug failures, but we think those are a symptom, not a cause.
>>>
>>> Currently running pike on Xenial.
>>>
>>> ---
>>> v/r
>>>
>>> Chris Apsey
>>> bitskrieg at bitskrieg.net
>>> https://www.bitskrieg.net
>>>
>>> _______________________________________________
>>> OpenStack-operators mailing list
>>> OpenStack-operators at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>>
>> This is actually a feature added in Pike:
>>
>> https://review.openstack.org/#/c/463597/
>>
>> This came up in discussion with operators at the Forum in Boston.
>>
>> The vif-plug failures are likely the reason those computes are getting
>> disabled.
>>
>> There is a config option "consecutive_build_service_disable_threshold"
>> which you can set to disable the auto-disable behavior as some have
>> experienced issues with it:
>>
>> https://bugs.launchpad.net/nova/+bug/1742102
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
More information about the OpenStack-operators
mailing list