[openstack-dev] [nova] automatically evacuate instances on compute failure

Oleg Gelbukh ogelbukh at mirantis.com
Wed Oct 16 14:04:03 UTC 2013


Tim,

Regarding this discussion, now there is at least a plan in Heat to allow
management of VMs not launched by that service:
https://blueprints.launchpad.net/heat/+spec/adopt-stack

So hopefully in future HARestarter will allow to support medium
availability for all types of instances.

--
Best regards,
Oleg Gelbukh
Mirantis Labs


On Wed, Oct 9, 2013 at 3:28 PM, Tim Bell <Tim.Bell at cern.ch> wrote:

> Would the HARestarter approach work for VMs which were not launched by
> Heat ?
>
> We expect to have some applications driven by Heat but lots of others
> would not be (especially the more 'pet'-like traditional workloads).
>
> Tim
>
> From: Oleg Gelbukh [mailto:ogelbukh at mirantis.com]
> Sent: 09 October 2013 13:01
> To: OpenStack Development Mailing List
> Subject: Re: [openstack-dev] [nova] automatically evacuate instances on
> compute failure
>
> Hello,
>
> We have much interest in this discussion (with focus on second scenario
> outlined by Tim), and working on its design at the moment. Thanks to
> everyone for valuable insights in this thread.
>
> It looks like external orchestration daemon problem is partially solved
> already by Heat with HARestarter resource [1].
>
> Hypervisor failure detection is also more or less solved problem in Nova
> [2]. There are other candidates for that task as well, like Ceilometer's
> hardware agent [3] (still WIP to my knowledge).
>
> [1]
> https://github.com/openstack/heat/blob/stable/grizzly/heat/engine/resources/instance.py#L35
> [2]
> http://docs.openstack.org/developer/nova/api/nova.api.openstack.compute.contrib.hypervisors.html#module-nova.api.openstack.compute.contrib.hypervisors
> [3]
> https://blueprints.launchpad.net/ceilometer/+spec/monitoring-physical-devices
> --
> Best regards,
> Oleg Gelbukh
> Mirantis Labs
>
> On Wed, Oct 9, 2013 at 9:26 AM, Tim Bell <Tim.Bell at cern.ch> wrote:
> I have proposed the summit design session for Hong Kong (
> http://summit.openstack.org/cfp/details/103) to discuss exactly these
> sort of points. We have the low level Nova commands but need a service to
> automate the process.
>
> I see two scenarios
>
> - A hardware intervention needs to be scheduled, please rebalance this
> workload elsewhere before it fails completely
> - A hypervisor has failed, please recover what you can using shared
> storage and give me a policy on what to do with the other VMs (restart,
> leave down till repair etc.)
>
> Most OpenStack production sites have some sort of script doing this sort
> of thing now. However, each one will be implementing the logic for
> migration differently so there is no agreed best practise approach.
>
> Tim
>
> > -----Original Message-----
> > From: Chris Friesen [mailto:chris.friesen at windriver.com]
> > Sent: 09 October 2013 00:48
> > To: openstack-dev at lists.openstack.org
> > Subject: Re: [openstack-dev] [nova] automatically evacuate instances on
> compute failure
> >
> > On 10/08/2013 03:20 PM, Alex Glikson wrote:
> > > Seems that this can be broken into 3 incremental pieces. First, would
> > > be great if the ability to schedule a single 'evacuate' would be
> > > finally merged
> > > (_
> https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_
> ).
> >
> > Agreed.
> >
> > > Then, it would make sense to have the logic that evacuates an entire
> > > host
> > > (_
> https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_
> ).
> > > The reasoning behind suggesting that this should not necessarily be in
> > > Nova is, perhaps, that it *can* be implemented outside Nova using the
> > > indvidual 'evacuate' API.
> >
> > This actually more-or-less exists already in the existing "nova
> host-evacuate" command.  One major issue with this however is that it
> > requires the caller to specify whether all the instances are on shared
> or local storage, and so it can't handle a mix of local and shared
> > storage for the instances.   If any of them boot off block storage for
> > instance you need to move them first and then do the remaining ones as a
> group.
> >
> > It would be nice to embed the knowledge of whether or not an instance is
> on shared storage in the instance itself at creation time.  I
> > envision specifying this in the config file for the compute manager
> along with the instance storage location, and the compute manager
> > could set the field in the instance at creation time.
> >
> > > Finally, it should be possible to close the loop and invoke the
> > > evacuation automatically as a result of a failure detection (not clear
> > > how exactly this would work, though). Hopefully we will have at least
> > > the first part merged soon (not sure if anyone is actively working on
> > > a rebase).
> >
> > My interpretation of the discussion so far is that the nova maintainers
> would prefer this to be driven by an outside orchestration daemon.
> >
> > Currently the only way a service is recognized to be "down" is if
> someone calls is_up() and it notices that the service hasn't sent an update
> > in the last minute.  There's nothing in nova actively scanning for
> compute node failures, which is where the outside daemon comes in.
> >
> > Also, there is some complexity involved in dealing with auto-evacuate:
> > What do you do if an evacuate fails?  How do you recover intelligently
> if there is no admin involved?
> >
> > Chris
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131016/37cdf688/attachment.html>


More information about the OpenStack-dev mailing list