[openstack-dev] [Nova][Heat] How to reliably detect VM failures?

Amit Ugol augol at redhat.com
Wed Mar 19 06:06:51 UTC 2014

On Tue, Mar 18, 2014 at 10:54:08PM +0800, Qiming Teng wrote:
> Hi, Folks,
>   I have been trying to implement a HACluster resource type in Heat. I
> haven't created a BluePrint for this because I am not sure everything
> will work as expected.
>   The basic idea is to extend the OS::Heat::ResourceGroup resource type
> with inner resource types fixed to be OS::Nova::Server.  Properties for
> this HACluster resource may include:
>   - init_size: initial number of Server instances;
>   - min_size: minimal number of Server instances;
>   - sig_handler: a reference to a sub-class of SignalResponder;
>   - zones: a list of strings representing the availability zones, which 
>           could be a names of the rack where the Server can be booted;
>   - recovery_action: a list of supported failure recovery actions, such
>       as 'restart', 'remote-restart', 'migrate';
>   - fencing_options: a dict specifying what to do to shutdown the Server
>       in a clean way so that data consistency in storage and network are
>       reserved;
>   - resource_ref: a dict for defining the Server instances to be
>       created.
>   Attributes of the HACluster may include:
>   - refs: a list of resource IDs for the currently active Servers;
>   - ips: a list of IP addresses for convenience.
>   Note that the 'remote-restart' action above is today referred to as
> 'evacuate'.
>   The most difficult issue here is to come up with a reliable VM failure
> detection mechanism.  The service_group feature in Nova only concerns
> about the OpenStack services themselves, not the VMs.  Considering that
> in our customer's cloud environment, user provided images can be used,
> we cannot assume some agents in the VMs to send heartbeat signals.
>   I have checked the 'instance' table in Nova database, it seemed that
> the 'update_at' column is only updated when VM state changed and
> reported.  If the 'heartbeat' messages are coming in from many VMs very
> frequently, there could be a DB query performance/scalability issue,
> right?
>   So, how can I detect VM failures reliably, so that I can notify Heat
> to take the appropriate recovery action?
Monitoring depends on what that VM is doing. For instance, a VM that hosts
a web server will not be monitored the same as an SQL server.

You might also want to take a look here:
> Regards,
>   - Qiming
> Research Scientist
> IBM Research - China
> tengqim at cn dot ibm dot com
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
---end quoted text---

Best Regards,
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 181 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140319/d9796fe9/attachment-0001.pgp>

More information about the OpenStack-dev mailing list