[openstack-dev] [Nova][Heat] How to reliably detect VM failures?

Qiming Teng tengqim at linux.vnet.ibm.com
Wed Mar 19 00:51:09 UTC 2014



On Tue, Mar 18, 2014 at 09:42:18AM -0700, Steven Dake wrote:
> On 03/18/2014 07:54 AM, Qiming Teng wrote:
> >Hi, Folks,
> >
> >   I have been trying to implement a HACluster resource type in Heat. I
> >haven't created a BluePrint for this because I am not sure everything
> >will work as expected.
> >
> >   The basic idea is to extend the OS::Heat::ResourceGroup resource type
> >with inner resource types fixed to be OS::Nova::Server.  Properties for
> >this HACluster resource may include:
> >
> >   - init_size: initial number of Server instances;
> >   - min_size: minimal number of Server instances;
> >   - sig_handler: a reference to a sub-class of SignalResponder;
> >   - zones: a list of strings representing the availability zones, which
> >           could be a names of the rack where the Server can be booted;
> >   - recovery_action: a list of supported failure recovery actions, such
> >       as 'restart', 'remote-restart', 'migrate';
> >   - fencing_options: a dict specifying what to do to shutdown the Server
> >       in a clean way so that data consistency in storage and network are
> >       reserved;
> >   - resource_ref: a dict for defining the Server instances to be
> >       created.
> >
> >   Attributes of the HACluster may include:
> >   - refs: a list of resource IDs for the currently active Servers;
> >   - ips: a list of IP addresses for convenience.
> >
> >   Note that the 'remote-restart' action above is today referred to as
> >'evacuate'.
> >
> >   The most difficult issue here is to come up with a reliable VM failure
> >detection mechanism.  The service_group feature in Nova only concerns
> >about the OpenStack services themselves, not the VMs.  Considering that
> >in our customer's cloud environment, user provided images can be used,
> >we cannot assume some agents in the VMs to send heartbeat signals.
> >
> >   I have checked the 'instance' table in Nova database, it seemed that
> >the 'update_at' column is only updated when VM state changed and
> >reported.  If the 'heartbeat' messages are coming in from many VMs very
> >frequently, there could be a DB query performance/scalability issue,
> >right?
> >
> >   So, how can I detect VM failures reliably, so that I can notify Heat
> >to take the appropriate recovery action?
> Qiming,
> 
> Check out
> 
> https://github.com/openstack/heat-templates/blob/master/cfn/F17/WordPress_Single_Instance_With_HA.template
> 
> You should be able to use the HARestarter resource and functionality
> to do healthchecking of a vm.
> 
> It would be cool if nova could grow a feature to actively look at
> the vm's state internally and determine if it was healthy (eg look
> at its memory and see if the scheduler is running, things like that)
> but this would require individual support from each hypervisor for
> such functionality.
> 
> Until that happens, healthchecking from within the vm seems like the
> only reasonable solution.
> 
> Regards
> -steve
> 

Yes, Steve, HARestarter is an option.  I have been playing with the
template you mentioned, for quite some days to make it work. Since I was
using RAW, not CFN_TOOLS, as the userdata_format for Servers, I passed
the CFN credentials, BOTO configs, among other files using CloudConfig.
To make heat-cfntools happy, I had to:

  - write the BOTO configs into /var/lib/heat-cfntools/cfn-boto-cfg
    because cfn-init hardcoded the BOTO_CONFIG environment variable.
  - provide a AWS::CloudFormation::Init metadata, to make cfn-init
    happy, despite that I was not using EC2::Instance for VM server.
  - provide faked AWS::StackName and AWS::Region since these are not
    working properly now.

The VM instance now can contact the CFN endpoint and CloudWatch
endpoint, correctly signal WaitCondition and other messages. However, I
do see it a solution tightly bound to heat-cfntools, or, not generic
enough, or may deprecate some day soon.

Then, back to my original question. What else can we do for reliably
                                  ---------------------------------
detect VM failures?
-------------------

We have noticed VM HA support from Windows Azure[1], CloudStack[2],
VMware vSphere[3], even Linux-HA[4], for example.  It would be highly
desirable to have some support from OpenStack. Our customers keep ask
for this feature, anyway.

Regards,
  Qiming

[1]
http://www.windowsazure.com/en-us/documentation/articles/manage-availability-virtual-machines/
[2]
http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.0.2/html/Admin_Guide/ha-enabled-vm.html
[3] http://www.vmware.com/products/vsphere/features-high-availability
[4] http://linux-ha.org/doc/man-pages/re-ra-VirtualDomain.html






More information about the OpenStack-dev mailing list