[Openstack-operators] What to do when a compute node dies?
Chris Friesen
chris.friesen at windriver.com
Mon Mar 30 14:42:50 UTC 2015
On 03/29/2015 09:26 PM, Mike Dorman wrote:
> Hi all,
>
> I’m curious about how people deal with failures of compute nodes, as in total
> failure when the box is gone for good. (Mainly care about KVM HV, but also
> interested in more general cases as well.)
>
> The particular situation we’re looking at: how end users could identify or be
> notified of VMs that no longer exist, because their hypervisor is dead. As I
> understand it, Nova will still believe VMs are running, and really has no way to
> know anything has changed (other than the nova-compute instance has dropped off.)
>
> I understand failure detection is a tricky thing. But it seems like there must
> be something a little better than this.
This is a timely question...I was wondering if it might make sense to upstream
one of the changes we've made locally.
We have an external entity monitoring the health of compute nodes. When one of
them goes down we automatically take action regarding the instances that had
been running on it.
Normally nova won't let you evacuate an instance until the compute node is
detected as "down", but that takes 60 sec typically and our software knows the
compute node is gone within a few seconds.
The change we made was to patch nova to allow the health monitor to explicitly
tell nova that the node is to be considered "down" (so that instances can be
evacuated without delay). When the external monitoring entity detects that the
compute node is back, it tells nova the node may be considered "up" (if nova
agrees that it's "up").
Is this ability to tell nova that a compute node is "down" something that would
be of interest to others?
Chris
More information about the OpenStack-operators
mailing list