[openstack-dev] [nova] automatically evacuate instances on compute failure

Chris Friesen chris.friesen at windriver.com
Tue Oct 8 22:47:52 UTC 2013


On 10/08/2013 03:20 PM, Alex Glikson wrote:
> Seems that this can be broken into 3 incremental pieces. First, would be
> great if the ability to schedule a single 'evacuate' would be finally
> merged
> (_https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_).

Agreed.

> Then, it would make sense to have the logic that evacuates an entire
> host
> (_https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_).
> The reasoning behind suggesting that this should not necessarily be in
> Nova is, perhaps, that it *can* be implemented outside Nova using the
> indvidual 'evacuate' API.

This actually more-or-less exists already in the existing "nova 
host-evacuate" command.  One major issue with this however is that it 
requires the caller to specify whether all the instances are on shared 
or local storage, and so it can't handle a mix of local and shared 
storage for the instances.   If any of them boot off block storage for 
instance you need to move them first and then do the remaining ones as a 
group.

It would be nice to embed the knowledge of whether or not an instance is 
on shared storage in the instance itself at creation time.  I envision 
specifying this in the config file for the compute manager along with 
the instance storage location, and the compute manager could set the 
field in the instance at creation time.

> Finally, it should be possible to close the
> loop and invoke the evacuation automatically as a result of a failure
> detection (not clear how exactly this would work, though). Hopefully we
> will have at least the first part merged soon (not sure if anyone is
> actively working on a rebase).

My interpretation of the discussion so far is that the nova maintainers 
would prefer this to be driven by an outside orchestration daemon.

Currently the only way a service is recognized to be "down" is if 
someone calls is_up() and it notices that the service hasn't sent an 
update in the last minute.  There's nothing in nova actively scanning 
for compute node failures, which is where the outside daemon comes in.

Also, there is some complexity involved in dealing with auto-evacuate: 
What do you do if an evacuate fails?  How do you recover intelligently 
if there is no admin involved?

Chris



More information about the OpenStack-dev mailing list