[openstack-dev] [heat] Re: deliver the vm-level HA to improve the business continuity with openstack

Qiming Teng tengqim at linux.vnet.ibm.com
Tue Apr 15 10:16:16 UTC 2014


What I saw in this thread are several topics:

1) Is VM HA really relevant (in a cloud)?

This is the most difficult question to answer, because it really depends
on who you are talking to, who are the user community you are facing.
IMHO, for most web-based applications that are born to run on cloud,
maybe certain level of business resiliency has already been built into
the code, so the application or service can live happily when VMs come
and go.

For traditional business applications, the scenario may be quite
different.  These apps are migrated to cloud for reasons like cost
savings, server consolidation, etc..  Quite some companies are
evaluating OpenStack for their "private cloud" -- which is a weird term,
IMHO.

In addition to this, while we are looking into the 'utility' vision of
cloud, we can still ask ourselves: a) can we survive one month of power
outage or water outage, though there are abundant supply elsewhere on
this
planet? b) what are the costs we need to pay if we eventually make it?
c) do we want to pay for this?

My personal experience is that our customers really want this feature
(VM HA) for their private clouds.  The question they asked us was:

"
  Does OpenStack support VM HA?  Maybe not for all VMS...
  We know we can have that using vSphere, Azure, or CloudStack...
"


2) Where is the best location to provide VM HA?

Suppose that we do feel the need to support VM HA, then the questions
following this would 'where' and 'how'.

Considering that a VM is not merely a bundle of compute processes, it is
actually a virtual execution environment that consumes resources like
storage and network bandwidth besides processor cycles, Nova may be NOT 
the ideal location to deal with this cross-cutting concern.

High availability involves redundant resource provisioning, effective
failure detection and appropriate fail-over policies, including fencing.
Imposing all these requirements on Nova is impractical.  We may need to 
consider whether VM HA, if ever implemented/supported, should be part of 
the orchestration service, aka Heat.


3) Can/should we do the VM HA orchestration in Heat?

My perception is that it can be done in Heat, based on my limited
understandig of how Heat works.  It may imply some requirements to other 
projects (e.g.  nova, cinder, neutron ...) as well, though Heat should be 
the orchestrator.

What do we need then?

  - A resource type for VM groups/clusters, for the redundant
    provisioning.  VMs in the group can be identical instances, managed 
    by a Pacemaker setup among the VMs, just like a WatchRule in Heat can 
    be controlled by Ceilometer.  

    Another way to do this is to have the VMs monitored via heartbeat 
    messages sent by Nova (if possible/needed), or some services injected 
    into the VMs (consider what cfn-hup, cfn-signal does today).

    However, the VM group/cluster can decide how to react to a VM online
    /offline signal.  It may choose to a) restart the VM in-place; b)
    remote-restart (aka evacuate) the VM somewhere else; c) live/cold 
    migrate the VM to other nodes.

    The policies can be out sourced to other plugins considering that
    global load-balancing or power management requirements.  But that is an
    advanced feature that warrants another blueprint.

  - Some fencing support from nova, cinder, neutron to shoot the bad VMs
    in the head so a VM that cannot be reached is guarantteed to be cleanly 
    killed.

  - VM failure detectors that can reliably tell whether a VM has failed.  
    Sometimes a VM that failed the expected performance goal should be
    treated as failed as well, if we really want to be strict on this.

    A failure detector can reside inside Nova, as what has been done for
    the 'service groups' there.  It can reside inside a VM, as a service
    istalled there, sending out heatbeat messages (before the battery runs 
    out, :))

  - A generic signaling mechanism that allows a secure message delivery
    back to Heat indicating that a VM is alive or dead.

My current understanding is that we may avoid complicated task-flow
here.

Regards,
  - Qiming


> >>For the most part we've been trying to encourage projects that want to
> >>control VMs to add such functionality to the Orchestration program, aka
> >>"Heat".
> >Yes, exactly.
> >
> >-jay
> >
> Hey folks,
> 
> Just as a note for HA for VMs, our current heat-core thinking is our
> HARestarter resource functionality is a workflow (Restarter is a
> verb, rather then a Noun - Heat orchestrates Nouns) and would be
> better suited to a workflow service like Mistral.  Clearly we don't
> know how to get from where we are today to the proper separation of
> concerns as pointed out by Zane Bitter in recent threads on the ml
> but just throwing this out there so folks are aware.
> 
> Regards
> -steve
> 




More information about the OpenStack-dev mailing list