[openstack-dev] [Nova][Heat] How to reliably detect VM failures?

Zane Bitter zbitter at redhat.com
Tue Mar 18 17:18:50 UTC 2014


On 18/03/14 12:42, Steven Dake wrote:
> On 03/18/2014 07:54 AM, Qiming Teng wrote:
>> Hi, Folks,
>>
>>    I have been trying to implement a HACluster resource type in Heat. I
>> haven't created a BluePrint for this because I am not sure everything
>> will work as expected.
>>
>>    The basic idea is to extend the OS::Heat::ResourceGroup resource type
>> with inner resource types fixed to be OS::Nova::Server.  Properties for
>> this HACluster resource may include:
>>
>>    - init_size: initial number of Server instances;
>>    - min_size: minimal number of Server instances;
>>    - sig_handler: a reference to a sub-class of SignalResponder;
>>    - zones: a list of strings representing the availability zones, which
>>            could be a names of the rack where the Server can be booted;
>>    - recovery_action: a list of supported failure recovery actions, such
>>        as 'restart', 'remote-restart', 'migrate';
>>    - fencing_options: a dict specifying what to do to shutdown the Server
>>        in a clean way so that data consistency in storage and network are
>>        reserved;
>>    - resource_ref: a dict for defining the Server instances to be
>>        created.
>>
>>    Attributes of the HACluster may include:
>>    - refs: a list of resource IDs for the currently active Servers;
>>    - ips: a list of IP addresses for convenience.
>>
>>    Note that the 'remote-restart' action above is today referred to as
>> 'evacuate'.
>>
>>    The most difficult issue here is to come up with a reliable VM failure
>> detection mechanism.  The service_group feature in Nova only concerns
>> about the OpenStack services themselves, not the VMs.  Considering that
>> in our customer's cloud environment, user provided images can be used,
>> we cannot assume some agents in the VMs to send heartbeat signals.
>>
>>    I have checked the 'instance' table in Nova database, it seemed that
>> the 'update_at' column is only updated when VM state changed and
>> reported.  If the 'heartbeat' messages are coming in from many VMs very
>> frequently, there could be a DB query performance/scalability issue,
>> right?
>>
>>    So, how can I detect VM failures reliably, so that I can notify Heat
>> to take the appropriate recovery action?
> Qiming,
>
> Check out
>
> https://github.com/openstack/heat-templates/blob/master/cfn/F17/WordPress_Single_Instance_With_HA.template
>
>
> You should be able to use the HARestarter resource and functionality to
> do healthchecking of a vm.

HARestarter is actually pretty problematic, both in a "causes major 
architectural headaches for Heat and will probably be deprecated very 
soon" sense and a "may do very unexpected things to your resources" 
sense. I wouldn't recommend it.

cheers,
Zane.

> It would be cool if nova could grow a feature to actively look at the
> vm's state internally and determine if it was healthy (eg look at its
> memory and see if the scheduler is running, things like that) but this
> would require individual support from each hypervisor for such
> functionality.
>
> Until that happens, healthchecking from within the vm seems like the
> only reasonable solution.
>
> Regards
> -steve
>
>> Regards,
>>    - Qiming
>>
>> Research Scientist
>> IBM Research - China
>> tengqim at cn dot ibm dot com
>>
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




More information about the OpenStack-dev mailing list