[Openstack-operators] What to do when a compute node dies?

Jay Pipes jaypipes at gmail.com
Tue Mar 31 03:53:45 UTC 2015


On 03/30/2015 07:30 PM, Chris Friesen wrote:
> On 03/30/2015 04:57 PM, Jay Pipes wrote:
>> On 03/30/2015 06:42 PM, Chris Friesen wrote:
>>> On 03/30/2015 02:47 PM, Jay Pipes wrote:
>>>> On 03/30/2015 10:42 AM, Chris Friesen wrote:
>>>>> On 03/29/2015 09:26 PM, Mike Dorman wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I’m curious about how people deal with failures of compute
>>>>>>  nodes, as in total failure when the box is gone for good.
>>>>>>  (Mainly care about KVM HV, but also interested in more
>>>>>> general cases as well.)
>>>>>>
>>>>>> The particular situation we’re looking at: how end users
>>>>>> could identify or be notified of VMs that no longer exist,
>>>>>> because their hypervisor is dead.  As I understand it, Nova
>>>>>> will still believe VMs are running, and really has no way
>>>>>> to know anything has changed (other than the nova-compute
>>>>>> instance has dropped off.)
>>>>>>
>>>>>> I understand failure detection is a tricky thing.  But it
>>>>>> seems like there must be something a little better than
>>>>>> this.
>>>>>
>>>>> This is a timely question...I was wondering if it might make
>>>>>  sense to upstream one of the changes we've made locally.
>>>>>
>>>>> We have an external entity monitoring the health of compute
>>>>> nodes. When one of them goes down we automatically take
>>>>> action regarding the instances that had been running on it.
>>>>>
>>>>> Normally nova won't let you evacuate an instance until the
>>>>> compute node is detected as "down", but that takes 60 sec
>>>>> typically and our software knows the compute node is gone
>>>>> within a few seconds.
>>>>
>>>> Any external monitoring solution that detects the compute node
>>>> is "down" could issue a call to `nova evacuate $HOST`.
>>>>
>>>> The question I have for you is what does your software
>>>> consider as a "downed" node? Is it some heartbeat-type stuff in
>>>> network connectivity? A watchdog in KVM? Some proactive
>>>> monitoring of disk or memory faults? Some combination?
>>>> Something entirely different? :)
>>>
>>> Combination of the above.  A local entity monitors "critical
>>> stuff" on the compute node, and heartbeats with a control node
>>> via one or more network links.
>>
>> OK.
>>
>>>>> The change we made was to patch nova to allow the health
>>>>> monitor to explicitly tell nova that the node is to be
>>>>> considered "down" (so that instances can be evacuated
>>>>> without delay).
>>>>
>>>> Why was it necessary to modify Nova for this? The external
>>>> monitoring script could easily do: `nova service-disable $HOST
>>>>  nova-compute` and that immediately takes the compute node out
>>>>  of service and enables evacuation.
>>>
>>> Disabling the service is not sufficient.
>>> compute.api.API.evacuate() throws an exception if
>>> servicegroup.api.API.service_is_up(service) is true.
>>
>> servicegroup.api.service_is_up() returns whether the service has
>> been disabled in the database (when using the DB servicegroup
>> driver). Which is what `nova service-disable $HOST nova-compute`
>> does.
>
> I must be missing something.
>
> It seems to me that servicegroup.drivers.db.DbDriver.is_up() returns
> whether the database row for the service has been updated for any
> reason within the last 60 seconds. (Assuming the default
> CONF.service_down_time.)
>
> Incidentally, I've proposed https://review.openstack.org/163060 to
> change that logic so that it returns whether the service has sent in
> a status report in the last 60 seconds.  (As it stands currently if
> you disable/enable a "down" service it'll report that the service is
> "up" for the next 60 seconds.)
>
>> What servicegroup driver are you using?
>
> The DB driver.

You've hit upon a bug. In no way should a disabled service be considered
"up". Apologies. I checked the code and indeed, there is no test for
whether the service record from the DB is disabled or not.

The servicegroup code needs to be refactored entirely to remove the 
current coupling to the DB (yes, it uses the DB even if you aren't using 
the DB servicegroup driver.... don't ask :( )

Here's the best (as in worst) part about this:

The DB driver's get_all() method -- which is used by, say, the scheduler
to grab a list of compute nodes that it can schedule to -- *does* only
get the non-disabled hosts. You wouldn't know that by looking at the
servicegroup.drivers.db.Driver.get_all() code, though:

http://git.openstack.org/cgit/openstack/nova/tree/nova/servicegroup/drivers/db.py#n91

Since all it does is call the DB API's service_get_all_by_topic() method.

However, look deeper in that method and lo and behold, there is a
hard-coded disabled=False filter in the SQLAlchemy query:

http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n479

Which is the reason why I actually didn't think the servicegroup API's
service_is_up() method would return any disabled services, because I'd
never seen it do so because the scheduler code I'm familiar with only
loops through non-disabled hosts. But, the evacuate code doesn't use the
same code paths, and thus, you saw the behaviour you did. :(

I've filed a bug here:

https://bugs.launchpad.net/nova/+bug/1438503

Thanks,
-jay



More information about the OpenStack-operators mailing list