Open Stack

Mon Mar 30 20:47:46 UTC 2015

On 03/30/2015 10:42 AM, Chris Friesen wrote:
> On 03/29/2015 09:26 PM, Mike Dorman wrote:
>> Hi all,
>>
>> I’m curious about how people deal with failures of compute nodes,
>> as in total failure when the box is gone for good.  (Mainly care
>> about KVM HV, but also interested in more general cases as well.)
>>
>> The particular situation we’re looking at: how end users could
>> identify or be notified of VMs that no longer exist, because their
>> hypervisor is dead.  As I understand it, Nova will still believe
>> VMs are running, and really has no way to know anything has changed
>> (other than the nova-compute instance has dropped off.)
>>
>> I understand failure detection is a tricky thing.  But it seems
>> like there must be something a little better than this.
>
> This is a timely question...I was wondering if it might make sense to
> upstream one of the changes we've made locally.
>
> We have an external entity monitoring the health of compute nodes.
> When one of them goes down we automatically take action regarding the
> instances that had been running on it.
>
> Normally nova won't let you evacuate an instance until the compute
> node is detected as "down", but that takes 60 sec typically and our
> software knows the compute node is gone within a few seconds.

Any external monitoring solution that detects the compute node is "down" 
could issue a call to `nova evacuate $HOST`.

The question I have for you is what does your software consider as a 
"downed" node? Is it some heartbeat-type stuff in network connectivity? 
A watchdog in KVM? Some proactive monitoring of disk or memory faults? 
Some combination? Something entirely different? :)

> The change we made was to patch nova to allow the health monitor to
> explicitly tell nova that the node is to be considered "down" (so
> that instances can be evacuated without delay).

Why was it necessary to modify Nova for this? The external monitoring 
script could easily do: `nova service-disable $HOST nova-compute` and 
that immediately takes the compute node out of service and enables 
evacuation.

 > When the external
> monitoring entity detects that the compute node is back, it tells
> nova the node may be considered "up" (if nova agrees that it's
> "up").

You mean `nova service-disable $HOST nova-compute`?

> Is this ability to tell nova that a compute node is "down" something
>  that would be of interest to others?

Unless I'm mistaken, `nova service-disable $HOST nova-compute` already 
exists that does this?

Best,
-jay

Open Stack

[Openstack-operators] What to do when a compute node dies?

OpenStack

Community

Documentation

Branding & Legal