[openstack-dev] [Nova][Heat] How to reliably detect VM failures?
Amit Ugol
augol at redhat.com
Wed Mar 19 06:06:51 UTC 2014
On Tue, Mar 18, 2014 at 10:54:08PM +0800, Qiming Teng wrote:
> Hi, Folks,
>
> I have been trying to implement a HACluster resource type in Heat. I
> haven't created a BluePrint for this because I am not sure everything
> will work as expected.
>
> The basic idea is to extend the OS::Heat::ResourceGroup resource type
> with inner resource types fixed to be OS::Nova::Server. Properties for
> this HACluster resource may include:
>
> - init_size: initial number of Server instances;
> - min_size: minimal number of Server instances;
> - sig_handler: a reference to a sub-class of SignalResponder;
> - zones: a list of strings representing the availability zones, which
> could be a names of the rack where the Server can be booted;
> - recovery_action: a list of supported failure recovery actions, such
> as 'restart', 'remote-restart', 'migrate';
> - fencing_options: a dict specifying what to do to shutdown the Server
> in a clean way so that data consistency in storage and network are
> reserved;
> - resource_ref: a dict for defining the Server instances to be
> created.
>
> Attributes of the HACluster may include:
> - refs: a list of resource IDs for the currently active Servers;
> - ips: a list of IP addresses for convenience.
>
> Note that the 'remote-restart' action above is today referred to as
> 'evacuate'.
>
> The most difficult issue here is to come up with a reliable VM failure
> detection mechanism. The service_group feature in Nova only concerns
> about the OpenStack services themselves, not the VMs. Considering that
> in our customer's cloud environment, user provided images can be used,
> we cannot assume some agents in the VMs to send heartbeat signals.
>
> I have checked the 'instance' table in Nova database, it seemed that
> the 'update_at' column is only updated when VM state changed and
> reported. If the 'heartbeat' messages are coming in from many VMs very
> frequently, there could be a DB query performance/scalability issue,
> right?
>
> So, how can I detect VM failures reliably, so that I can notify Heat
> to take the appropriate recovery action?
Hi,
Monitoring depends on what that VM is doing. For instance, a VM that hosts
a web server will not be monitored the same as an SQL server.
You might also want to take a look here:
http://docs.openstack.org/developer/heat/template_guide/openstack.html#OS::Ceilometer::Alarm
>
> Regards,
> - Qiming
>
> Research Scientist
> IBM Research - China
> tengqim at cn dot ibm dot com
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
---end quoted text---
--
Best Regards,
Amit.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 181 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140319/d9796fe9/attachment-0001.pgp>
More information about the OpenStack-dev
mailing list