[openstack-dev] [Nova] Automatic evacuate
Russell Bryant
rbryant at redhat.com
Thu Oct 16 12:07:24 UTC 2014
On 10/16/2014 05:01 AM, Thomas Herve wrote:
>
>>>> This still doesn't do away with the requirement to reliably detect
>>>> node failure, and to fence misbehaving nodes. Detecting that a node
>>>> has failed, and fencing it if unsure, is a prerequisite for any
>>>> recovery action. So you need Corosync/Pacemaker anyway.
>>>
>>> Obviously, yes. My post covered all of that directly ... the tagging
>>> bit was just additional input into the recovery operation.
>>
>> This is essentially why I am saying using the Pacemaker stack is the
>> smarter approach than hacking something into Ceilometer and Heat. You
>> already need Pacemaker for service availability (and all major vendors
>> have adopted it for that purpose), so a highly available cloud that
>> does *not* use Pacemaker at all won't be a vendor supported option for
>> some time. So people will already be running Pacemaker — then why not
>> use it for what it's good at?
>
> I may be missing something, but Pacemaker will only provide
> monitoring of your compute node, right? I think the advantage you
> would get by using something like Heat is having an instance agent
> and provide monitoring of your client service, instead of just
> knowing the status of your hypervisor. Hosts can fail, but there is
> another array of failures that you can't handle with the global
> deployment monitoring.
I think that's an important problem, too.
The thread was started talking about evacuate, which is used in the case
of a host failure. I wrote up a more detailed proposal of using an
external tool (Pacemaker) to handle automatic evacuation of failed hosts.
For a guest OS failure, we have some basic watchdog support. From my
blog post:
"It’s worth noting that the libvirt/KVM driver in OpenStack does contain
one feature related to guest operating system failure. The
libvirt-watchdog blueprint was implemented in the Icehouse release of
Nova. This feature allows you to set the hw_watchdog_action property on
either the image or flavor. Valid values include poweroff, reset,
pause, and none. When this is enabled, libvirt will enable the i6300esb
watchdog device for the guest and will perform the requested action if
the watchdog is triggered. This may be a helpful component of your
strategy for recovery from guest failures."
HA in the case of application failures can be handled in several ways,
depending on the application. It's really a separate problem space,
though, IMO.
--
Russell Bryant
More information about the OpenStack-dev
mailing list