[openstack-dev] [nova][libvirt] RFC: ensuring live migration ends

Robert Collins robertc at robertcollins.net
Sun Feb 1 19:24:20 UTC 2015


On 31 January 2015 at 05:47, Daniel P. Berrange <berrange at redhat.com> wrote:
> In working on a recent Nova migration bug
>
>   https://bugs.launchpad.net/nova/+bug/1414065
>
> I had cause to refactor the way the nova libvirt driver monitors live
> migration completion/failure/progress. This refactor has opened the
> door for doing more intelligent active management of the live migration
> process.
...
> What kind of things would be the biggest win from Operators' or tenants'
> POV ?

Awesome. Couple thoughts from my perspective. Firstly, there's a bunch
of situation dependent tuning. One thing Crowbar does really nicely is
that you specify the host layout in broad abstract terms - e.g. 'first
10G network link' and so on : some of your settings above like whether
to compress page are going to be heavily dependent on the bandwidth
available (I doubt that compression is a win on a 100G link for
instance, and would be suspect at 10G even). So it would be nice if
there was a single dial or two to set and Nova would auto-calculate
good defaults from that (with appropriate overrides being available).

Operationally avoiding trouble is better than being able to fix it, so
I quite like the idea of defaulting the auto-converge option on, or
perhaps making it controllable via flavours, so that operators can
offer (and identify!) those particularly performance sensitive
workloads rather than having to guess which instances are special and
which aren't.

Being able to cancel the migration would be good. Relatedly being able
to restart nova-compute while a migration is going on would be good
(or put differently, a migration happening shouldn't prevent a deploy
of Nova code: interlocks like that make continuous deployment much
harder).

If we can't already, I'd like as a user to be able to see that the
migration is happening (allows diagnosis of transient issues during
the migration). Some ops folk may want to hide that of course.

I'm not sure that automatically rolling back after N minutes makes
sense : if the impact on the cluster is significant then 1 minute vs
10 doesn't instrinsically matter: what matters more is preventing too
many concurrent migrations, so that would be another feature that I
don't think we have yet: don't allow more than some N inbound and M
outbound live migrations to a compute host at any time, to prevent IO
storms. We may want to log with NOTIFICATION migrations that are still
progressing but appear to be having trouble completing. And of course
an admin API to query all migrations in progress to allow API driven
health checks by monitoring tools - which gives the power to manage
things to admins without us having to write a probably-too-simple
config interface.

HTH,
Rob

-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Converged Cloud



More information about the OpenStack-dev mailing list