[openstack-dev] [Nova][Heat] How to reliably detect VM failures?

Steven Dake sdake at redhat.com
Tue Mar 18 16:42:18 UTC 2014

On 03/18/2014 07:54 AM, Qiming Teng wrote:
> Hi, Folks,
>    I have been trying to implement a HACluster resource type in Heat. I
> haven't created a BluePrint for this because I am not sure everything
> will work as expected.
>    The basic idea is to extend the OS::Heat::ResourceGroup resource type
> with inner resource types fixed to be OS::Nova::Server.  Properties for
> this HACluster resource may include:
>    - init_size: initial number of Server instances;
>    - min_size: minimal number of Server instances;
>    - sig_handler: a reference to a sub-class of SignalResponder;
>    - zones: a list of strings representing the availability zones, which
>            could be a names of the rack where the Server can be booted;
>    - recovery_action: a list of supported failure recovery actions, such
>        as 'restart', 'remote-restart', 'migrate';
>    - fencing_options: a dict specifying what to do to shutdown the Server
>        in a clean way so that data consistency in storage and network are
>        reserved;
>    - resource_ref: a dict for defining the Server instances to be
>        created.
>    Attributes of the HACluster may include:
>    - refs: a list of resource IDs for the currently active Servers;
>    - ips: a list of IP addresses for convenience.
>    Note that the 'remote-restart' action above is today referred to as
> 'evacuate'.
>    The most difficult issue here is to come up with a reliable VM failure
> detection mechanism.  The service_group feature in Nova only concerns
> about the OpenStack services themselves, not the VMs.  Considering that
> in our customer's cloud environment, user provided images can be used,
> we cannot assume some agents in the VMs to send heartbeat signals.
>    I have checked the 'instance' table in Nova database, it seemed that
> the 'update_at' column is only updated when VM state changed and
> reported.  If the 'heartbeat' messages are coming in from many VMs very
> frequently, there could be a DB query performance/scalability issue,
> right?
>    So, how can I detect VM failures reliably, so that I can notify Heat
> to take the appropriate recovery action?

Check out


You should be able to use the HARestarter resource and functionality to 
do healthchecking of a vm.

It would be cool if nova could grow a feature to actively look at the 
vm's state internally and determine if it was healthy (eg look at its 
memory and see if the scheduler is running, things like that) but this 
would require individual support from each hypervisor for such 

Until that happens, healthchecking from within the vm seems like the 
only reasonable solution.


> Regards,
>    - Qiming
> Research Scientist
> IBM Research - China
> tengqim at cn dot ibm dot com
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

More information about the OpenStack-dev mailing list