[openstack-dev] [Nova][Heat] How to reliably detect VM failures?

Qiming Teng tengqim at linux.vnet.ibm.com
Tue Mar 18 14:54:08 UTC 2014

Hi, Folks,

  I have been trying to implement a HACluster resource type in Heat. I
haven't created a BluePrint for this because I am not sure everything
will work as expected.

  The basic idea is to extend the OS::Heat::ResourceGroup resource type
with inner resource types fixed to be OS::Nova::Server.  Properties for
this HACluster resource may include:

  - init_size: initial number of Server instances;
  - min_size: minimal number of Server instances;
  - sig_handler: a reference to a sub-class of SignalResponder;
  - zones: a list of strings representing the availability zones, which 
          could be a names of the rack where the Server can be booted;
  - recovery_action: a list of supported failure recovery actions, such
      as 'restart', 'remote-restart', 'migrate';
  - fencing_options: a dict specifying what to do to shutdown the Server
      in a clean way so that data consistency in storage and network are
  - resource_ref: a dict for defining the Server instances to be

  Attributes of the HACluster may include:
  - refs: a list of resource IDs for the currently active Servers;
  - ips: a list of IP addresses for convenience.

  Note that the 'remote-restart' action above is today referred to as

  The most difficult issue here is to come up with a reliable VM failure
detection mechanism.  The service_group feature in Nova only concerns
about the OpenStack services themselves, not the VMs.  Considering that
in our customer's cloud environment, user provided images can be used,
we cannot assume some agents in the VMs to send heartbeat signals.

  I have checked the 'instance' table in Nova database, it seemed that
the 'update_at' column is only updated when VM state changed and
reported.  If the 'heartbeat' messages are coming in from many VMs very
frequently, there could be a DB query performance/scalability issue,

  So, how can I detect VM failures reliably, so that I can notify Heat
to take the appropriate recovery action?

  - Qiming

Research Scientist
IBM Research - China
tengqim at cn dot ibm dot com

More information about the OpenStack-dev mailing list