[openstack-dev] [nova][libvirt] RFC: ensuring live migration ends

Daniel P. Berrange berrange at redhat.com
Mon Feb 2 10:56:56 UTC 2015


On Mon, Feb 02, 2015 at 08:24:20AM +1300, Robert Collins wrote:
> On 31 January 2015 at 05:47, Daniel P. Berrange <berrange at redhat.com> wrote:
> > In working on a recent Nova migration bug
> >
> >   https://bugs.launchpad.net/nova/+bug/1414065
> >
> > I had cause to refactor the way the nova libvirt driver monitors live
> > migration completion/failure/progress. This refactor has opened the
> > door for doing more intelligent active management of the live migration
> > process.
> ...
> > What kind of things would be the biggest win from Operators' or tenants'
> > POV ?
> 
> Awesome. Couple thoughts from my perspective. Firstly, there's a bunch
> of situation dependent tuning. One thing Crowbar does really nicely is
> that you specify the host layout in broad abstract terms - e.g. 'first
> 10G network link' and so on : some of your settings above like whether
> to compress page are going to be heavily dependent on the bandwidth
> available (I doubt that compression is a win on a 100G link for
> instance, and would be suspect at 10G even). So it would be nice if
> there was a single dial or two to set and Nova would auto-calculate
> good defaults from that (with appropriate overrides being available).

I wonder how such an idea would fit into Nova, since it doesn't really
have that kind of knowledge about the network deployment characteristics.

> Operationally avoiding trouble is better than being able to fix it, so
> I quite like the idea of defaulting the auto-converge option on, or
> perhaps making it controllable via flavours, so that operators can
> offer (and identify!) those particularly performance sensitive
> workloads rather than having to guess which instances are special and
> which aren't.

I'll investigate the auto-converge further to find out what the
potential downsides of it are. If we can unconditionally enable
it, it would be simpler than adding yet more tunables.

> Being able to cancel the migration would be good. Relatedly being able
> to restart nova-compute while a migration is going on would be good
> (or put differently, a migration happening shouldn't prevent a deploy
> of Nova code: interlocks like that make continuous deployment much
> harder).
> 
> If we can't already, I'd like as a user to be able to see that the
> migration is happening (allows diagnosis of transient issues during
> the migration). Some ops folk may want to hide that of course.
> 
> I'm not sure that automatically rolling back after N minutes makes
> sense : if the impact on the cluster is significant then 1 minute vs
> 10 doesn't instrinsically matter: what matters more is preventing too
> many concurrent migrations, so that would be another feature that I
> don't think we have yet: don't allow more than some N inbound and M
> outbound live migrations to a compute host at any time, to prevent IO
> storms. We may want to log with NOTIFICATION migrations that are still
> progressing but appear to be having trouble completing. And of course
> an admin API to query all migrations in progress to allow API driven
> health checks by monitoring tools - which gives the power to manage
> things to admins without us having to write a probably-too-simple
> config interface.

Interesting, the point about concurrent migrations hadn't occurred to
me before, but it does of course make sense since migration is
primarily network bandwidth limited, though disk bandwidth is relevant
too if doing block migration.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|



More information about the OpenStack-dev mailing list