Open Stack

Thu Oct 16 08:52:41 UTC 2014

On Thu, Oct 16, 2014 at 9:25 AM, Jastrzebski, Michal
<michal.jastrzebski at intel.com> wrote:
> In my opinion flavor defining is a bit hacky. Sure, it will provide us
> functionality fairly quickly, but also will strip us from flexibility Heat
> would give. Healing can be done in several ways, simple destroy -> create
> (basic convergence workflow so far), evacuate with or without
> shared storage, even rebuild vm, probably few more when we put more thoughts
> to it.

But then you'd also need to monitor the availability of *individual*
guest and down you go the rabbit hole.

So suppose you're monitoring a guest with a simple ping. And it stops
responding to that ping.

(1) Has it died?
(2) Is it just too busy to respond to the ping?
(3) Has its guest network stack died?
(4) Has its host vif died?
(5) Has the L2 agent on the compute host died?
(6) Has its host network stack died?
(7) Has the compute host died?

Suppose further it's using shared storage (running off an RBD volume
or using an iSCSI volume, or whatever). Now you have almost as many
recovery options as possible causes for the failure, and some of those
recovery options will potentially destroy your guest's data.

No matter how you twist and turn the problem, you need strongly
consistent distributed VM state plus fencing. In other words, you need
a full blown HA stack.

> I'd rather use nova for low level task and maybe low level monitoring (imho
> nova should do that using servicegroup). But I'd use something more more
> configurable for actual task triggering like heat. That would give us
> framework rather than mechanism. Later we might want to apply HA on network or
> volume, then we'll have mechanism ready just monitoring hook and healing
> will need to be implemented.
>
> We can use scheduler hints to place resource on host HA-compatible
> (whichever health action we'd like to use), this will bit more complicated, but
> also will give us more flexibility.

I apologize in advance for my bluntness, but this all sounds to me
like you're vastly underrating the problem of reliable guest state
detection and recovery. :)

> I agree that we all should meet in Paris and discuss that so we can join our
> forces. This is one of bigger gaps to be filled imho.

Pretty much every user I've worked with in the last 2 years agrees.
Granted, my view may be skewed as HA is typically what customers
approach us for in the first place, but yes, this definitely needs a
globally understood and supported solution.

Cheers,
Florian

Open Stack

[openstack-dev] [Nova] Automatic evacuate

OpenStack

Community

Documentation

Branding & Legal