[openstack-dev] [Nova][Heat] How to reliably detect VM failures?
tengqim at linux.vnet.ibm.com
Wed Mar 19 00:51:09 UTC 2014
On Tue, Mar 18, 2014 at 09:42:18AM -0700, Steven Dake wrote:
> On 03/18/2014 07:54 AM, Qiming Teng wrote:
> >Hi, Folks,
> > I have been trying to implement a HACluster resource type in Heat. I
> >haven't created a BluePrint for this because I am not sure everything
> >will work as expected.
> > The basic idea is to extend the OS::Heat::ResourceGroup resource type
> >with inner resource types fixed to be OS::Nova::Server. Properties for
> >this HACluster resource may include:
> > - init_size: initial number of Server instances;
> > - min_size: minimal number of Server instances;
> > - sig_handler: a reference to a sub-class of SignalResponder;
> > - zones: a list of strings representing the availability zones, which
> > could be a names of the rack where the Server can be booted;
> > - recovery_action: a list of supported failure recovery actions, such
> > as 'restart', 'remote-restart', 'migrate';
> > - fencing_options: a dict specifying what to do to shutdown the Server
> > in a clean way so that data consistency in storage and network are
> > reserved;
> > - resource_ref: a dict for defining the Server instances to be
> > created.
> > Attributes of the HACluster may include:
> > - refs: a list of resource IDs for the currently active Servers;
> > - ips: a list of IP addresses for convenience.
> > Note that the 'remote-restart' action above is today referred to as
> > The most difficult issue here is to come up with a reliable VM failure
> >detection mechanism. The service_group feature in Nova only concerns
> >about the OpenStack services themselves, not the VMs. Considering that
> >in our customer's cloud environment, user provided images can be used,
> >we cannot assume some agents in the VMs to send heartbeat signals.
> > I have checked the 'instance' table in Nova database, it seemed that
> >the 'update_at' column is only updated when VM state changed and
> >reported. If the 'heartbeat' messages are coming in from many VMs very
> >frequently, there could be a DB query performance/scalability issue,
> > So, how can I detect VM failures reliably, so that I can notify Heat
> >to take the appropriate recovery action?
> Check out
> You should be able to use the HARestarter resource and functionality
> to do healthchecking of a vm.
> It would be cool if nova could grow a feature to actively look at
> the vm's state internally and determine if it was healthy (eg look
> at its memory and see if the scheduler is running, things like that)
> but this would require individual support from each hypervisor for
> such functionality.
> Until that happens, healthchecking from within the vm seems like the
> only reasonable solution.
Yes, Steve, HARestarter is an option. I have been playing with the
template you mentioned, for quite some days to make it work. Since I was
using RAW, not CFN_TOOLS, as the userdata_format for Servers, I passed
the CFN credentials, BOTO configs, among other files using CloudConfig.
To make heat-cfntools happy, I had to:
- write the BOTO configs into /var/lib/heat-cfntools/cfn-boto-cfg
because cfn-init hardcoded the BOTO_CONFIG environment variable.
- provide a AWS::CloudFormation::Init metadata, to make cfn-init
happy, despite that I was not using EC2::Instance for VM server.
- provide faked AWS::StackName and AWS::Region since these are not
working properly now.
The VM instance now can contact the CFN endpoint and CloudWatch
endpoint, correctly signal WaitCondition and other messages. However, I
do see it a solution tightly bound to heat-cfntools, or, not generic
enough, or may deprecate some day soon.
Then, back to my original question. What else can we do for reliably
detect VM failures?
We have noticed VM HA support from Windows Azure, CloudStack,
VMware vSphere, even Linux-HA, for example. It would be highly
desirable to have some support from OpenStack. Our customers keep ask
for this feature, anyway.
More information about the OpenStack-dev