[openstack-dev] blueprint proposal nova-compute fencing for HA ?

Alex Glikson GLIKSON at il.ibm.com
Mon Apr 22 12:58:13 UTC 2013

I think this is a good idea.
We already have a framework in Nova to detect and report the failure 
(service group monitoring APIs, with DB and ZK backends already 
implemented), as well as APIs to list instances on a host and to evacuate 
individual instances (soon with destination selected by the scheduler). 
Indeed, the missing pieces now are the end-to-end orchestration (which is 
probably not going to happen within Nova, at least at the moment), and the 
mechanism(s) to isolate the failed host (e.g., to protect against false 
failure detection events) -- which could potentially happen in several 
places, as you mentioned. It might be the case that whatever can be done 
within Nova is already there -- the corresponding nova-compute will be 
considered down. So, maybe now the question is which additional components 
might be used (as you mentioned -- bare-metal, quantum, cinder, etc). Once 
the individual measures are clear (and implemented), the orchestration 
logic (wherever that would be) can use them.


From:   Leen Besselink <ubuntu at consolejunkie.net>
To:     OpenStack Development Mailing List 
<openstack-dev at lists.openstack.org>, 
Date:   22/04/2013 03:18 PM
Subject:        [openstack-dev] blueprint proposal nova-compute fencing 
for HA ?


As I have not been at the summit and the technical videos of the Summit 
are not yet online I am not aware of what was discusses there.

But I would like to submit a blueprint.

My idea is:

It is a step to support VM High availability.

This part is about handling compute node failure.

My proposal would be to create a framework/API/plugin/agent or whatever is 
needed for fencing off a nova-compute node.

So when something detects that a nova-compute node isn't functional 
anymore it can fence off that nova-compute node.

After which it can call 'evacuate' to start the instance(s) that were 
previously running on the failed compute node on other compute node(s).

The implementation of the code that handles the fencing could be 
implemented in different ways for different environments:

- The IPMI-code that handle baremetal provisining could for example be 
used to poweroff or reboot the node.

- The Quantum networking code could be used to "disconnect" the 
instance(s) of the failed compute node (or the whole compute node) from 
their respective networks. If you are using overlays you could configure 
other machines not to accept tunnel traffic from the failed compute node 
for the networks of the instance(s)

- You could also have a firewall agent configure the shared storage 
servers (or a firewall in between) to not accept traffic from the failed 
compute node

I am sure other people have other ideas.

My request would be to have an API and general framework which can call 
the different implementations that are configured for that environment.

Does that make any sense ?

Or maybe should this be handled by creating clusters with for example 
pacemaker like I assume oVirt might be doing with their proposals:


As I am not yet all that familar with the structure of OpenStack or how it 
is organized it could be I am asking in the wrong place to discuss this or 
if it architecturally does not fit in then do let me know where I went 

I've looked at the list of existing blueprints and I at least see other 
evacuate, fault-tolerance/HA- and other related blueprints as well:




I think it would be a good idea to have an idea of what all of the 
usecases are and then split them up in tasks.

Hope this is helpful.

Have a nice day,

OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20130422/694632f1/attachment.html>

More information about the OpenStack-dev mailing list