[openstack-dev] [Nova][Heat] How to reliably detect VM failures?

Qiming Teng tengqim at linux.vnet.ibm.com
Thu Mar 20 02:38:50 UTC 2014

On Wed, Mar 19, 2014 at 12:08:30PM -0400, Zane Bitter wrote:
> On 19/03/14 02:07, Chris Friesen wrote:
> >On 03/18/2014 11:18 AM, Zane Bitter wrote:
> >>On 18/03/14 12:42, Steven Dake wrote:
> >
> >>>You should be able to use the HARestarter resource and functionality to
> >>>do healthchecking of a vm.
> >>
> >>HARestarter is actually pretty problematic, both in a "causes major
> >>architectural headaches for Heat and will probably be deprecated very
> >>soon" sense and a "may do very unexpected things to your resources"
> >>sense. I wouldn't recommend it.
> >
> >Could you elaborate?  What unexpected things might it do?  And what are
> >the alternatives?
> First of all, despite the name, it doesn't just restart but actually
> deletes the server that it's monitoring and recreates an entirely
> new one. It also deletes any resources which directly or indirectly
> depend on the server being monitored and recreates them too.
> The alternative is to use Ceilometer alarms and/or some external
> monitoring system and implement recovery yourself, since the
> strategy you want depends on both your application and the type of
> failure.
> Another avenue being explored in Heat is to have a general way of
> bringing a stack back into line with its template:
> https://blueprints.launchpad.net/heat/+spec/stack-convergence
> cheers,
> Zane.

Thanks, Zane.  Though I wasn't able to make the HA sample template work
in my environment (primarily due to some CloudWatch token authentication
failures), I did get some hands-on experience how 'HARestarter' is 
actually doing the VM 'restart' work.  A VM is just a resource that can
be recreated, from HARestarter's perspective. This is simple, effective,
but too brutle a way to 'restart' VM servers, ;)

What I am trying to do is to achieve certain level of HA for VMs which
are treated as black-boxes.  When something bad happens, some VM health
monitoring system can quickly detect and report it to Heat. So Heat can
decide, based on user-specified policy, to 

  1) reboot or rebuild the VM with the same identity, or
  2) evacuate (i.e. remote-restart) it on another host, or
  3) migrate it to another host.

The recovery actions above, for Heat, are just invocations to Nova APIs.
But I am not suggesting that VM failures should be handled in Nova
directly. IMHO, this level of orchestration should go to Heat.

To avoid messing up data consistency or network setup, some fencing
operations are to be done -- blueprints on this are either under-review
or being implemented in cinder, neutron. 

I don't think it a good idea to rely on some external monitoring systems
to do a VM failure detection. It means additional steps to set up,
additional software to upgrade, additional chapter in the Operator's
Guide, etc.  We are evaluating whether Ceilometer can do a good job

Regarding the stack convergence work, it is a good starting point. If I
may suggest something, I'd like to see the separation between the two

  - the robustness of the Heat (engine) itself, including API retry ...

  - the health monitoring of the stack created by Heat, which can be
    done either via active status polling or reactive event handling

In this context, a cluster of VM can be monitored as a single entity in
the stack.  When stack convergence check is performed, such a cluster
(say 2 members) can report, for example:

  - Green: supposed to have 2 members (Servers) running, and they are
    both active now.

  - Yellow: supposed to have 2 members, but one is reported as down,
    trying to recover now.

  - Red: both Servers are seemed hijacked by aliens, need some action
    now, rebuild me (the cluster) if needed.

It would be good to have stack convergence do a per-resource-type 
monitoring/recovery action.

Just some random thoughts for discussion.


More information about the OpenStack-dev mailing list