[openstack-dev] [heat][nova] VM restarting on host failure in convergence
Jastrzebski, Michal
michal.jastrzebski at intel.com
Fri Sep 19 09:16:10 UTC 2014
> > In short, what we'll need from nova is to have 100% reliable
> > host-health monitor and equally reliable rebuild/evacuate mechanism
> > with fencing and scheduler. In heat we need scallable and reliable
> > event listener and engine to decide which action to perform in given
> > situation.
>
> Unfortunately, I don't think Nova can provide this alone. Nova only
> knows about whether or not the nova-compute daemon is current
> communicating with the rest of the system. Even if the nova-compute
> daemon drops out, the compute node may still be running all instances
> just fine. We certainly don't want to impact those running workloads
> unless absolutely necessary.
But, on the other hand if host is really down, nova might want to know
that, if only to change insances status to ERROR or whatever. I don't
think situation when instance is down due to host failure, and nova
doesn't know that is good for anyone.
> I understand that you're suggesting that we enhance Nova to be able to
> provide that level of knowledge and control. I actually don't think
> Nova should have this knowledge of its underlying infrastructure.
>
> I would put the host monitoring infrastructure (to determine if a host
> is down) and fencing capability as out of scope for Nova and as a part
> of the supporting infrastructure. Assuming those pieces can properly
> detect that a host is down and fence it, then all that's needed from
> Nova is the evacuate capability, which is already there. There may be
> some enhancements that could be done to it, but surely it's quite close.
Why do you think nova shouldn't have information about underlying infra?
Since service group is pluggin based, we could develop new plugin for
enhancing nova's information reliability whthout any impact on current
code. I'm a bit concerned about dependency injection we'd have to make.
I'd love to be in situation, where people would have some level (maybe
not best they can get) of SLA in heat out of the box, without bigger
investment in infrastructure configuration.
> There's also the part where a notification needs to go out saying that
> the instance has failed. Some thing (which could be Heat in the case of
> this proposal) can react to that, either directly or via ceilometer, for
> example. There is an API today to hard reset the state of an instance
> to ERROR. After a host is fenced, you could use this API to mark all
> instances on that host as dead. I'm not sure if there's an easy way to
> do that for all instances on a host today. That's likely an enhancement
> we could make to python-novaclient, similar to the "evacuate all
> instances on a host" enhancement that was done in novaclient.
Why nova itself wouldn't do that? I mean, nova should know real status
of its instances at all times in my opinion.
Thanks,
Michał
More information about the OpenStack-dev
mailing list