[Openstack-operators] What to do when a compute node dies?

Chris Friesen chris.friesen at windriver.com
Mon Mar 30 14:42:50 UTC 2015


On 03/29/2015 09:26 PM, Mike Dorman wrote:
> Hi all,
>
> I’m curious about how people deal with failures of compute nodes, as in total
> failure when the box is gone for good.  (Mainly care about KVM HV, but also
> interested in more general cases as well.)
>
> The particular situation we’re looking at: how end users could identify or be
> notified of VMs that no longer exist, because their hypervisor is dead.  As I
> understand it, Nova will still believe VMs are running, and really has no way to
> know anything has changed (other than the nova-compute instance has dropped off.)
>
> I understand failure detection is a tricky thing.  But it seems like there must
> be something a little better than this.

This is a timely question...I was wondering if it might make sense to upstream 
one of the changes we've made locally.

We have an external entity monitoring the health of compute nodes.  When one of 
them goes down we automatically take action regarding the instances that had 
been running on it.

Normally nova won't let you evacuate an instance until the compute node is 
detected as "down", but that takes 60 sec typically and our software knows the 
compute node is gone within a few seconds.

The change we made was to patch nova to allow the health monitor to explicitly 
tell nova that the node is to be considered "down" (so that instances can be 
evacuated without delay).  When the external monitoring entity detects that the 
compute node is back, it tells nova the node may be considered "up" (if nova 
agrees that it's "up").

Is this ability to tell nova that a compute node is "down" something that would 
be of interest to others?

Chris



More information about the OpenStack-operators mailing list