[Openstack-operators] What to do when a compute node dies?

Jay Pipes jaypipes at gmail.com
Mon Mar 30 22:57:38 UTC 2015


On 03/30/2015 06:42 PM, Chris Friesen wrote:
> On 03/30/2015 02:47 PM, Jay Pipes wrote:
>> On 03/30/2015 10:42 AM, Chris Friesen wrote:
>>> On 03/29/2015 09:26 PM, Mike Dorman wrote:
>>>> Hi all,
>>>>
>>>> I’m curious about how people deal with failures of compute
>>>> nodes, as in total failure when the box is gone for good.
>>>> (Mainly care about KVM HV, but also interested in more general
>>>> cases as well.)
>>>>
>>>> The particular situation we’re looking at: how end users could
>>>> identify or be notified of VMs that no longer exist, because
>>>> their hypervisor is dead.  As I understand it, Nova will still
>>>> believe VMs are running, and really has no way to know anything
>>>> has changed (other than the nova-compute instance has dropped
>>>> off.)
>>>>
>>>> I understand failure detection is a tricky thing.  But it
>>>> seems like there must be something a little better than this.
>>>
>>> This is a timely question...I was wondering if it might make
>>> sense to upstream one of the changes we've made locally.
>>>
>>> We have an external entity monitoring the health of compute
>>> nodes. When one of them goes down we automatically take action
>>> regarding the instances that had been running on it.
>>>
>>> Normally nova won't let you evacuate an instance until the
>>> compute node is detected as "down", but that takes 60 sec
>>> typically and our software knows the compute node is gone within
>>> a few seconds.
>>
>> Any external monitoring solution that detects the compute node is
>> "down" could issue a call to `nova evacuate $HOST`.
>>
>> The question I have for you is what does your software consider as
>> a "downed" node? Is it some heartbeat-type stuff in network
>> connectivity? A watchdog in KVM? Some proactive monitoring of disk
>> or memory faults? Some combination? Something entirely different?
>> :)
>
> Combination of the above.  A local entity monitors "critical stuff"
> on the compute node, and heartbeats with a control node via one or
> more network links.

OK.

>>> The change we made was to patch nova to allow the health monitor
>>> to explicitly tell nova that the node is to be considered "down"
>>> (so that instances can be evacuated without delay).
>>
>> Why was it necessary to modify Nova for this? The external
>> monitoring script could easily do: `nova service-disable $HOST
>> nova-compute` and that immediately takes the compute node out of
>> service and enables evacuation.
>
> Disabling the service is not sufficient.  compute.api.API.evacuate()
>  throws an exception if servicegroup.api.API.service_is_up(service)
> is true.

servicegroup.api.service_is_up() returns whether the service has been 
disabled in the database (when using the DB servicegroup driver). Which 
is what `nova service-disable $HOST nova-compute` does.

>>> When the external monitoring entity detects that the compute node
>>> is back, it tells nova the node may be considered "up" (if nova
>>> agrees that it's "up").
>>
>> You mean `nova service-disable $HOST nova-compute`?
>>
>>> Is this ability to tell nova that a compute node is "down"
>>> something that would be of interest to others?
>>
>> Unless I'm mistaken, `nova service-disable $HOST nova-compute`
>> already exists that does this?
>
> No, what we have is basically a way to cause
> servicegroup.api.API.service_is_up() to return false. That causes
> the correct status to be displayed in the "State" column in the
> output of "nova service-list" and allows evacuation to proceed.

That's exactly what `nova service-disable $HOST nova-compute` does.

What servicegroup driver are you using?

Best,
-jay



More information about the OpenStack-operators mailing list