[openstack-dev] [heat][nova] VM restarting on host failure in convergence

Clint Byrum clint at fewbar.com
Wed Sep 17 19:04:14 UTC 2014


Excerpts from Jastrzebski, Michal's message of 2014-09-17 06:03:06 -0700:
> All,
> 
> Currently OpenStack does not have a built-in HA mechanism for tenant
> instances which could restore virtual machines in case of a host
> failure. Openstack assumes every app is designed for failure and can
> handle instance failure and will self-remediate, but that is rarely
> the case for the very large Enterprise application ecosystem.
> Many existing enterprise applications are stateful, and assume that
> the physical infrastructure is always on.
> 

There is a fundamental debate that OpenStack's vendors need to work out
here. Existing applications are well served by existing virtualization
platforms. Turning OpenStack into a work-alike to oVirt is not the end
goal here. It's a happy accident that traditional apps can sometimes be
bent onto the cloud without much modification.

The thing that clouds do is they give development teams a _limited_
infrastructure that lets IT do what they're good at (keep the
infrastructure up) and lets development teams do what they're good at (run
their app). By putting HA into the _app_, and not the _infrastructure_,
the dev teams get agility and scalability. No more waiting weeks for
allocationg specialized servers with hardware fencing setups and fibre
channel controllers to house a shared disk system so the super reliable
virtualization can hide HA from the user.

Spin up vms. Spin up volumes.  Run some replication between regions,
and be resilient.

So, as long as it is understood that whatever is being proposed should
be an application centric feature, and not an infrastructure centric
feature, this argument remains interesting in the "cloud" context.
Otherwise, it is just an invitation for OpenStack to open up direct
competition with behemoths like vCenter.

> Even the OpenStack controller services themselves do not gracefully
> handle failure.
> 

Which ones?

> When these applications were virtualized, they were virtualized on
> platforms that enabled very high SLAs for each virtual machine,
> allowing the application to not be rewritten as the IT team moved them
> from physical to virtual. Now while these apps cannot benefit from
> methods like automatic scaleout, the application owners will greatly
> benefit from the self-service capabilities they will recieve as they
> utilize the OpenStack control plane.
> 

These apps were virtualized for IT's benefit. But the application authors
and users are now stuck in high-cost virtualization. The cloud is best
utilized when IT can control that cost and shift the burden of uptime
to the users by offering them more overall capacity and flexibility with
the caveat that the individual resources will not be as reliable.

So what I'm most interested in is helping authors change their apps to
be reslient on their own, not in putting more burden on IT.

> I'd like to suggest to expand heat convergence mechanism to enable
> self-remediation of virtual machines and other heat resources.
> 

Convergence is still nascent. I don't know if I'd pile on to what might
take another 12 - 18 months to get done anyway. We're just now figuring
out how to get started where we thought we might already be 1/3 of the
way through. Just something to consider.



More information about the OpenStack-dev mailing list