[Openstack-operators] [nova][libvirt] RFC: ensuring live migration ends
Daniel P. Berrange
berrange at redhat.com
Fri Jan 30 16:47:16 UTC 2015
In working on a recent Nova migration bug
I had cause to refactor the way the nova libvirt driver monitors live
migration completion/failure/progress. This refactor has opened the
door for doing more intelligent active management of the live migration
As it stands today, we launch live migration, with a possible bandwidth
limit applied and just pray that it succeeds eventually. It might take
until the end of the universe and we'll happily wait that long. This is
pretty dumb really and I think we really ought to do better. The problem
is that I'm not really sure what "better" should mean, except for ensuring
it doesn't run forever.
As a demo, I pushed a quick proof of concept showing how we could easily
just abort live migration after say 10 minutes
There are a number of possible things to consider though...
First how to detect when live migration isn't going to succeeed.
- Could do a crude timeout, eg allow 10 minutes to succeeed or else.
- Look at data transfer stats (memory transferred, memory remaining to
transfer, disk transferred, disk remaining to transfer) to determine
if it is making forward progress.
- Leave it upto the admin / user to decided if it has gone long enough
The first is easy, while the second is harder but probably more reliable
and useful for users.
Second is a question of what todo when it looks to be failing
- Cancel the migration - leave it running on source. Not good if the
admin is trying to evacuate a host.
- Pause the VM - make it complete as non-live migration. Not good if
the guest workload doesn't like being paused
- Increase the bandwidth permitted. There is a built-in rate limit in
QEMU overridable via nova.conf. Could argue that the admin should just
set their desired limit in nova.conf and be done with it, but perhaps
there's a case for increasing it in special circumstances. eg emergency
evacuate of host it is better to waste bandwidth & complete the job,
but for non-urgent scenarios better to limit bandwidth & accept failure ?
- Increase the maximum downtime permitted. This is the small time window
when the guest switches from source to dest. To small and it'll never
switch, too large and it'll suffer unacceptable interuption.
We could do some of these things automatically based on some policy
or leave them upto the cloud admin/tenant user via new APIs
Third there's question of other QEMU features we could make use of to
stop problems in the first place
- Auto-converge flag - if you set this QEMU throttles back the CPUs
so the guest cannot dirty ram pages as quickly. This is nicer than
pausing CPUs altogether, but could still be an issue for guests
which have strong performance requirements
- Page compression flag - if you set this QEMU does compression of
pages to reduce data that has to be sent. This is basically trading
off network bandwidth vs CPU burn. Probably a win unless you are
already highly overcomit on CPU on the host
Fourth there's a question of whether we should give the tenant user or
cloud admin further APIs for influencing migration
- Add an explicit API for cancelling migration ?
- Add APIs for setting tunables like downtime, bandwidth on the fly ?
- Or drive some of the tunables like downtime, bandwidth, or policies
like cancel vs paused from flavour or image metadata properties ?
- Allow operations like evacuate to specify a live migration policy
eg switch non-live migrate after 5 minutes ?
The current code is so crude and there's a hell of alot of options we
can take. I'm just not sure which is the best direction for us to go
What kind of things would be the biggest win from Operators' or tenants'
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
More information about the OpenStack-operators