[openstack-dev] [nova] automatically evacuate instances on compute failure

Alex Glikson GLIKSON at il.ibm.com
Wed Oct 9 11:59:46 UTC 2013


> Hypervisor failure detection is also more or less solved problem in Nova 
[2]. There are other candidates for that task as well, like Ceilometer's 
hardware agent [3] (still WIP to my knowledge).

The problem is that in some cases you want to be *really* sure that the 
hypervisor is down before running 'evacuate' (otherwise it could lead to 
an application crash). And you want to do it on scale. So, polling and 
traditional monitoring might not be good enough for a fully-automated 
service (e.g., you may need to do 'fencing' to ensure that the node will 
not suddenly come back with all the VMs still running).

Regards,
Alex




From:   Oleg Gelbukh <ogelbukh at mirantis.com>
To:     OpenStack Development Mailing List 
<openstack-dev at lists.openstack.org>, 
Date:   09/10/2013 02:09 PM
Subject:        Re: [openstack-dev] [nova] automatically evacuate 
instances on compute failure



Hello,

We have much interest in this discussion (with focus on second scenario 
outlined by Tim), and working on its design at the moment. Thanks to 
everyone for valuable insights in this thread.

It looks like external orchestration daemon problem is partially solved 
already by Heat with HARestarter resource [1].

Hypervisor failure detection is also more or less solved problem in Nova 
[2]. There are other candidates for that task as well, like Ceilometer's 
hardware agent [3] (still WIP to my knowledge).

[1] 
https://github.com/openstack/heat/blob/stable/grizzly/heat/engine/resources/instance.py#L35
[2] 
http://docs.openstack.org/developer/nova/api/nova.api.openstack.compute.contrib.hypervisors.html#module-nova.api.openstack.compute.contrib.hypervisors
[3] 
https://blueprints.launchpad.net/ceilometer/+spec/monitoring-physical-devices
--
Best regards,
Oleg Gelbukh
Mirantis Labs


On Wed, Oct 9, 2013 at 9:26 AM, Tim Bell <Tim.Bell at cern.ch> wrote:
I have proposed the summit design session for Hong Kong (
http://summit.openstack.org/cfp/details/103) to discuss exactly these sort 
of points. We have the low level Nova commands but need a service to 
automate the process.

I see two scenarios

- A hardware intervention needs to be scheduled, please rebalance this 
workload elsewhere before it fails completely
- A hypervisor has failed, please recover what you can using shared 
storage and give me a policy on what to do with the other VMs (restart, 
leave down till repair etc.)

Most OpenStack production sites have some sort of script doing this sort 
of thing now. However, each one will be implementing the logic for 
migration differently so there is no agreed best practise approach.

Tim

> -----Original Message-----
> From: Chris Friesen [mailto:chris.friesen at windriver.com]
> Sent: 09 October 2013 00:48
> To: openstack-dev at lists.openstack.org
> Subject: Re: [openstack-dev] [nova] automatically evacuate instances on 
compute failure
>
> On 10/08/2013 03:20 PM, Alex Glikson wrote:
> > Seems that this can be broken into 3 incremental pieces. First, would
> > be great if the ability to schedule a single 'evacuate' would be
> > finally merged
> > (_
https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_
).
>
> Agreed.
>
> > Then, it would make sense to have the logic that evacuates an entire
> > host
> > (_
https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_
).
> > The reasoning behind suggesting that this should not necessarily be in
> > Nova is, perhaps, that it *can* be implemented outside Nova using the
> > indvidual 'evacuate' API.
>
> This actually more-or-less exists already in the existing "nova 
host-evacuate" command.  One major issue with this however is that it
> requires the caller to specify whether all the instances are on shared 
or local storage, and so it can't handle a mix of local and shared
> storage for the instances.   If any of them boot off block storage for
> instance you need to move them first and then do the remaining ones as a 
group.
>
> It would be nice to embed the knowledge of whether or not an instance is 
on shared storage in the instance itself at creation time.  I
> envision specifying this in the config file for the compute manager 
along with the instance storage location, and the compute manager
> could set the field in the instance at creation time.
>
> > Finally, it should be possible to close the loop and invoke the
> > evacuation automatically as a result of a failure detection (not clear
> > how exactly this would work, though). Hopefully we will have at least
> > the first part merged soon (not sure if anyone is actively working on
> > a rebase).
>
> My interpretation of the discussion so far is that the nova maintainers 
would prefer this to be driven by an outside orchestration daemon.
>
> Currently the only way a service is recognized to be "down" is if 
someone calls is_up() and it notices that the service hasn't sent an 
update
> in the last minute.  There's nothing in nova actively scanning for 
compute node failures, which is where the outside daemon comes in.
>
> Also, there is some complexity involved in dealing with auto-evacuate:
> What do you do if an evacuate fails?  How do you recover intelligently 
if there is no admin involved?
>
> Chris
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
_______________________________________________
OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131009/3a79f632/attachment.html>


More information about the OpenStack-dev mailing list