[openstack-dev] [Nova] Automatic evacuate

Florian Haas florian at hastexo.com
Thu Oct 16 09:27:04 UTC 2014


On Thu, Oct 16, 2014 at 11:01 AM, Thomas Herve
<thomas.herve at enovance.com> wrote:
>
>> >> This still doesn't do away with the requirement to reliably detect
>> >> node failure, and to fence misbehaving nodes. Detecting that a node
>> >> has failed, and fencing it if unsure, is a prerequisite for any
>> >> recovery action. So you need Corosync/Pacemaker anyway.
>> >
>> > Obviously, yes.  My post covered all of that directly ... the tagging
>> > bit was just additional input into the recovery operation.
>>
>> This is essentially why I am saying using the Pacemaker stack is the
>> smarter approach than hacking something into Ceilometer and Heat. You
>> already need Pacemaker for service availability (and all major vendors
>> have adopted it for that purpose), so a highly available cloud that
>> does *not* use Pacemaker at all won't be a vendor supported option for
>> some time. So people will already be running Pacemaker — then why not
>> use it for what it's good at?
>
> I may be missing something, but Pacemaker will only provide monitoring of your compute node, right? I think the advantage you would get by using something like Heat is having an instance agent and provide monitoring of your client service, instead of just knowing the status of your hypervisor. Hosts can fail, but there is another array of failures that you can't handle with the global deployment monitoring.

You *are* missing something, indeed. :) Pacemaker would be a perfectly
fine tool for also monitoring the status of your guests on the hosts.
So arguably, nova-compute could in fact hook in with pcsd
(https://github.com/feist/pcs/tree/master/pcs -- all in Python) down
the road to inject VM monitoring into the Pacemaker configuration.
This would, of course, need to be specific to the hypervisor so it
would be a job for the nova driver, rather than being implemented at
the nova-compute level.

But my hunch is that that sort of thing would be for the L release;
for Kilo the low-hanging fruit would be to defend against host failure
(meaning, compute node failure, unrecoverable nova-compute service
failure, etc.).

Cheers,
Florian



More information about the OpenStack-dev mailing list