[Openstack-operators] What to do when a compute node dies?

Chris Friesen chris.friesen at windriver.com
Mon Mar 30 23:30:35 UTC 2015


On 03/30/2015 04:57 PM, Jay Pipes wrote:
> On 03/30/2015 06:42 PM, Chris Friesen wrote:
>> On 03/30/2015 02:47 PM, Jay Pipes wrote:
>>> On 03/30/2015 10:42 AM, Chris Friesen wrote:
>>>> On 03/29/2015 09:26 PM, Mike Dorman wrote:
>>>>> Hi all,
>>>>>
>>>>> I’m curious about how people deal with failures of compute
>>>>> nodes, as in total failure when the box is gone for good.
>>>>> (Mainly care about KVM HV, but also interested in more general
>>>>> cases as well.)
>>>>>
>>>>> The particular situation we’re looking at: how end users could
>>>>> identify or be notified of VMs that no longer exist, because
>>>>> their hypervisor is dead.  As I understand it, Nova will still
>>>>> believe VMs are running, and really has no way to know anything
>>>>> has changed (other than the nova-compute instance has dropped
>>>>> off.)
>>>>>
>>>>> I understand failure detection is a tricky thing.  But it
>>>>> seems like there must be something a little better than this.
>>>>
>>>> This is a timely question...I was wondering if it might make
>>>> sense to upstream one of the changes we've made locally.
>>>>
>>>> We have an external entity monitoring the health of compute
>>>> nodes. When one of them goes down we automatically take action
>>>> regarding the instances that had been running on it.
>>>>
>>>> Normally nova won't let you evacuate an instance until the
>>>> compute node is detected as "down", but that takes 60 sec
>>>> typically and our software knows the compute node is gone within
>>>> a few seconds.
>>>
>>> Any external monitoring solution that detects the compute node is
>>> "down" could issue a call to `nova evacuate $HOST`.
>>>
>>> The question I have for you is what does your software consider as
>>> a "downed" node? Is it some heartbeat-type stuff in network
>>> connectivity? A watchdog in KVM? Some proactive monitoring of disk
>>> or memory faults? Some combination? Something entirely different?
>>> :)
>>
>> Combination of the above.  A local entity monitors "critical stuff"
>> on the compute node, and heartbeats with a control node via one or
>> more network links.
>
> OK.
>
>>>> The change we made was to patch nova to allow the health monitor
>>>> to explicitly tell nova that the node is to be considered "down"
>>>> (so that instances can be evacuated without delay).
>>>
>>> Why was it necessary to modify Nova for this? The external
>>> monitoring script could easily do: `nova service-disable $HOST
>>> nova-compute` and that immediately takes the compute node out of
>>> service and enables evacuation.
>>
>> Disabling the service is not sufficient.  compute.api.API.evacuate()
>>  throws an exception if servicegroup.api.API.service_is_up(service)
>> is true.
>
> servicegroup.api.service_is_up() returns whether the service has been disabled
> in the database (when using the DB servicegroup driver). Which is what `nova
> service-disable $HOST nova-compute` does.

I must be missing something.

It seems to me that servicegroup.drivers.db.DbDriver.is_up() returns whether the 
database row for the service has been updated for any reason within the last 60 
seconds. (Assuming the default CONF.service_down_time.)

Incidentally, I've proposed https://review.openstack.org/163060 to change that 
logic so that it returns whether the service has sent in a status report in the 
last 60 seconds.  (As it stands currently if you disable/enable a "down" service 
it'll report that the service is "up" for the next 60 seconds.)

> What servicegroup driver are you using?

The DB driver.

Chris



More information about the OpenStack-operators mailing list