[openstack-dev] [Nova][Heat] How to reliably detect VM failures?
Zane Bitter
zbitter at redhat.com
Tue Mar 18 17:18:50 UTC 2014
On 18/03/14 12:42, Steven Dake wrote:
> On 03/18/2014 07:54 AM, Qiming Teng wrote:
>> Hi, Folks,
>>
>> I have been trying to implement a HACluster resource type in Heat. I
>> haven't created a BluePrint for this because I am not sure everything
>> will work as expected.
>>
>> The basic idea is to extend the OS::Heat::ResourceGroup resource type
>> with inner resource types fixed to be OS::Nova::Server. Properties for
>> this HACluster resource may include:
>>
>> - init_size: initial number of Server instances;
>> - min_size: minimal number of Server instances;
>> - sig_handler: a reference to a sub-class of SignalResponder;
>> - zones: a list of strings representing the availability zones, which
>> could be a names of the rack where the Server can be booted;
>> - recovery_action: a list of supported failure recovery actions, such
>> as 'restart', 'remote-restart', 'migrate';
>> - fencing_options: a dict specifying what to do to shutdown the Server
>> in a clean way so that data consistency in storage and network are
>> reserved;
>> - resource_ref: a dict for defining the Server instances to be
>> created.
>>
>> Attributes of the HACluster may include:
>> - refs: a list of resource IDs for the currently active Servers;
>> - ips: a list of IP addresses for convenience.
>>
>> Note that the 'remote-restart' action above is today referred to as
>> 'evacuate'.
>>
>> The most difficult issue here is to come up with a reliable VM failure
>> detection mechanism. The service_group feature in Nova only concerns
>> about the OpenStack services themselves, not the VMs. Considering that
>> in our customer's cloud environment, user provided images can be used,
>> we cannot assume some agents in the VMs to send heartbeat signals.
>>
>> I have checked the 'instance' table in Nova database, it seemed that
>> the 'update_at' column is only updated when VM state changed and
>> reported. If the 'heartbeat' messages are coming in from many VMs very
>> frequently, there could be a DB query performance/scalability issue,
>> right?
>>
>> So, how can I detect VM failures reliably, so that I can notify Heat
>> to take the appropriate recovery action?
> Qiming,
>
> Check out
>
> https://github.com/openstack/heat-templates/blob/master/cfn/F17/WordPress_Single_Instance_With_HA.template
>
>
> You should be able to use the HARestarter resource and functionality to
> do healthchecking of a vm.
HARestarter is actually pretty problematic, both in a "causes major
architectural headaches for Heat and will probably be deprecated very
soon" sense and a "may do very unexpected things to your resources"
sense. I wouldn't recommend it.
cheers,
Zane.
> It would be cool if nova could grow a feature to actively look at the
> vm's state internally and determine if it was healthy (eg look at its
> memory and see if the scheduler is running, things like that) but this
> would require individual support from each hypervisor for such
> functionality.
>
> Until that happens, healthchecking from within the vm seems like the
> only reasonable solution.
>
> Regards
> -steve
>
>> Regards,
>> - Qiming
>>
>> Research Scientist
>> IBM Research - China
>> tengqim at cn dot ibm dot com
>>
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list