[Openstack-operators] What to do when a compute node dies?
Jay Pipes
jaypipes at gmail.com
Mon Mar 30 20:47:46 UTC 2015
On 03/30/2015 10:42 AM, Chris Friesen wrote:
> On 03/29/2015 09:26 PM, Mike Dorman wrote:
>> Hi all,
>>
>> I’m curious about how people deal with failures of compute nodes,
>> as in total failure when the box is gone for good. (Mainly care
>> about KVM HV, but also interested in more general cases as well.)
>>
>> The particular situation we’re looking at: how end users could
>> identify or be notified of VMs that no longer exist, because their
>> hypervisor is dead. As I understand it, Nova will still believe
>> VMs are running, and really has no way to know anything has changed
>> (other than the nova-compute instance has dropped off.)
>>
>> I understand failure detection is a tricky thing. But it seems
>> like there must be something a little better than this.
>
> This is a timely question...I was wondering if it might make sense to
> upstream one of the changes we've made locally.
>
> We have an external entity monitoring the health of compute nodes.
> When one of them goes down we automatically take action regarding the
> instances that had been running on it.
>
> Normally nova won't let you evacuate an instance until the compute
> node is detected as "down", but that takes 60 sec typically and our
> software knows the compute node is gone within a few seconds.
Any external monitoring solution that detects the compute node is "down"
could issue a call to `nova evacuate $HOST`.
The question I have for you is what does your software consider as a
"downed" node? Is it some heartbeat-type stuff in network connectivity?
A watchdog in KVM? Some proactive monitoring of disk or memory faults?
Some combination? Something entirely different? :)
> The change we made was to patch nova to allow the health monitor to
> explicitly tell nova that the node is to be considered "down" (so
> that instances can be evacuated without delay).
Why was it necessary to modify Nova for this? The external monitoring
script could easily do: `nova service-disable $HOST nova-compute` and
that immediately takes the compute node out of service and enables
evacuation.
> When the external
> monitoring entity detects that the compute node is back, it tells
> nova the node may be considered "up" (if nova agrees that it's
> "up").
You mean `nova service-disable $HOST nova-compute`?
> Is this ability to tell nova that a compute node is "down" something
> that would be of interest to others?
Unless I'm mistaken, `nova service-disable $HOST nova-compute` already
exists that does this?
Best,
-jay
More information about the OpenStack-operators
mailing list