[Openstack-operators] What to do when a compute node dies?

Chris Friesen chris.friesen at windriver.com
Mon Mar 30 22:42:36 UTC 2015


On 03/30/2015 02:47 PM, Jay Pipes wrote:
> On 03/30/2015 10:42 AM, Chris Friesen wrote:
>> On 03/29/2015 09:26 PM, Mike Dorman wrote:
>>> Hi all,
>>>
>>> I’m curious about how people deal with failures of compute nodes,
>>> as in total failure when the box is gone for good.  (Mainly care
>>> about KVM HV, but also interested in more general cases as well.)
>>>
>>> The particular situation we’re looking at: how end users could
>>> identify or be notified of VMs that no longer exist, because their
>>> hypervisor is dead.  As I understand it, Nova will still believe
>>> VMs are running, and really has no way to know anything has changed
>>> (other than the nova-compute instance has dropped off.)
>>>
>>> I understand failure detection is a tricky thing.  But it seems
>>> like there must be something a little better than this.
>>
>> This is a timely question...I was wondering if it might make sense to
>> upstream one of the changes we've made locally.
>>
>> We have an external entity monitoring the health of compute nodes.
>> When one of them goes down we automatically take action regarding the
>> instances that had been running on it.
>>
>> Normally nova won't let you evacuate an instance until the compute
>> node is detected as "down", but that takes 60 sec typically and our
>> software knows the compute node is gone within a few seconds.
>
> Any external monitoring solution that detects the compute node is "down" could
> issue a call to `nova evacuate $HOST`.
>
> The question I have for you is what does your software consider as a "downed"
> node? Is it some heartbeat-type stuff in network connectivity? A watchdog in
> KVM? Some proactive monitoring of disk or memory faults? Some combination?
> Something entirely different? :)

Combination of the above.  A local entity monitors "critical stuff" on the 
compute node, and heartbeats with a control node via one or more network links.

>> The change we made was to patch nova to allow the health monitor to
>> explicitly tell nova that the node is to be considered "down" (so
>> that instances can be evacuated without delay).
>
> Why was it necessary to modify Nova for this? The external monitoring script
> could easily do: `nova service-disable $HOST nova-compute` and that immediately
> takes the compute node out of service and enables evacuation.

Disabling the service is not sufficient.  compute.api.API.evacuate() throws an 
exception if servicegroup.api.API.service_is_up(service) is true.

>  > When the external
>> monitoring entity detects that the compute node is back, it tells
>> nova the node may be considered "up" (if nova agrees that it's
>> "up").
>
> You mean `nova service-disable $HOST nova-compute`?
>
>> Is this ability to tell nova that a compute node is "down" something
>>  that would be of interest to others?
>
> Unless I'm mistaken, `nova service-disable $HOST nova-compute` already exists
> that does this?

No, what we have is basically a way to cause 
servicegroup.api.API.service_is_up() to return false.  That causes the correct 
status to be displayed in the "State" column in the output of "nova 
service-list" and allows evacuation to proceed.

Chris




More information about the OpenStack-operators mailing list