[openstack-dev] [nova][libvirt] RFC: ensuring live migration ends

Daniel P. Berrange berrange at redhat.com
Mon Feb 2 10:51:29 UTC 2015


On Sat, Jan 31, 2015 at 03:55:23AM +0100, Vladik Romanovsky wrote:
> 
> 
> ----- Original Message -----
> > From: "Daniel P. Berrange" <berrange at redhat.com>
> > To: openstack-dev at lists.openstack.org, openstack-operators at lists.openstack.org
> > Sent: Friday, 30 January, 2015 11:47:16 AM
> > Subject: [openstack-dev] [nova][libvirt] RFC: ensuring live migration ends
> > 
> > In working on a recent Nova migration bug
> > 
> >   https://bugs.launchpad.net/nova/+bug/1414065
> > 
> > I had cause to refactor the way the nova libvirt driver monitors live
> > migration completion/failure/progress. This refactor has opened the
> > door for doing more intelligent active management of the live migration
> > process.
> > 
> > As it stands today, we launch live migration, with a possible bandwidth
> > limit applied and just pray that it succeeds eventually. It might take
> > until the end of the universe and we'll happily wait that long. This is
> > pretty dumb really and I think we really ought to do better. The problem
> > is that I'm not really sure what "better" should mean, except for ensuring
> > it doesn't run forever.
> > 
> > As a demo, I pushed a quick proof of concept showing how we could easily
> > just abort live migration after say 10 minutes
> > 
> >   https://review.openstack.org/#/c/151665/
> > 
> > There are a number of possible things to consider though...
> > 
> > First how to detect when live migration isn't going to succeeed.
> > 
> >  - Could do a crude timeout, eg allow 10 minutes to succeeed or else.
> > 
> >  - Look at data transfer stats (memory transferred, memory remaining to
> >    transfer, disk transferred, disk remaining to transfer) to determine
> >    if it is making forward progress.
> 
> I think this is a better option. We could define a timeout for the progress
> and cancel if there is no progress. IIRC there were similar debates about it
> in Ovirt, we could do something similar:
> https://github.com/oVirt/vdsm/blob/master/vdsm/virt/migration.py#L430

That looks like quite a good implementation to follow. They are monitoring
progress and if they see progress stalling, then they wait a configurable
time before aborting. That should avoid prematurely aborting migrations
that are actually working, while avoiding migrations getting stuck forever.
They also have a global timeout which is based on the number of GB of RAM
the guest has, which is also a good idea compared to a one-size-fits-all
timeout.

> > Fourth there's a question of whether we should give the tenant user or
> > cloud admin further APIs for influencing migration
> > 
> >  - Add an explicit API for cancelling migration ?
> > 
> >  - Add APIs for setting tunables like downtime, bandwidth on the fly ?
> > 
> >  - Or drive some of the tunables like downtime, bandwidth, or policies
> >    like cancel vs paused from flavour or image metadata properties ?
> > 
> >  - Allow operations like evacuate to specify a live migration policy
> >    eg switch non-live migrate after 5 minutes ?
> > 
> IMHO, an explicit API for cancelling migration is very much needed.
> I remember cases when migrations took about 8 or hours, leaving the
> admins helpless :)

The oVirt hueristics should avoid that stuck scenario, but I do think
we need an API anyway.

> Also, I very much like the idea of having tunables and policy to set
> in the flavours and image properties.
> To allow the administrators to set these as a "template" in the flavour
> and also to let the users to update/override or "request" these options
> as they should know the best (hopefully) what is running in their guests.

We do need to make sure the administrators can always force migration
to succeed regardless of what the user might have configured, so they
can be ensured of emergency evacuation if needed.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|



More information about the OpenStack-dev mailing list