[openstack-dev] [nova] post-copy live migration
Paul Carlton
paul.carlton2 at hpe.com
Tue Apr 5 16:03:33 UTC 2016
On 05/04/16 16:33, Daniel P. Berrange wrote:
> On Tue, Apr 05, 2016 at 05:17:41PM +0200, Luis Tomas wrote:
>> Hi,
>>
>> We are working on the possibility of including post-copy live migration into
>> Nova (https://review.openstack.org/#/c/301509/)
>>
>> At libvirt level, post-copy live migration works as follow:
>> - Start live migration with a post-copy enabler flag
>> (VIR_MIGRATE_POSTCOPY). Note this does not mean the migration is performed
>> in post-copy mode, just that you can switch it to post-copy at any given
>> time.
>> - Change the migration from pre-copy to post-copy mode.
>>
>> However, we are not sure what's the most convenient way of providing this
>> functionality at Nova level.
>> The current specs, propose to include an optional flag at the live migration
>> API to include the VIR_MIGRATE_POSTCOPY flag when starting the live
>> migration. Then we propose a second API to actually switch the migration
>> from pre-copy to post-copy mode similarly to how it is done in LibVirt. This
>> is also similar to how the new "force-migrate" option works to ensure
>> migrations completion. In fact, this method could be an extension of the
>> force-migrate, by switching to postcopy if the migration was started with
>> the VIR_MIGRATE_POSTCOPY libvirt flag, or pause it otherwise.
>>
>> The cons of this approach are that we expose a too specific mechanism
>> through the API. To alleviate this, we could remove the "switch" API, and
>> automatize the switch based on data transferred, available bandwidth or
>> other related metrics. However we will still need the extension to the
>> live-migration API to include the proper libvirt postcopy flag.
> No we absolutely don't want to expose that in the API as a concept, as it
> is private technical implementation detail of the KVM migration code.
>
>> The other solution is to start all the migrations with the
>> VIR_MIGRATE_POSTCOPY mode, and therefore no new APIs would be needed. The
>> system could automatically detect the migration is taking too long (or is
>> dirting memory faster than the sending rate), and automatically switch to
>> post-copy.
> Yes this is what we should be doing as default behaviour with new enough
> QEMU IMHO.
>
>> The cons of this is that including the VIR_MIGRATE_POSTCOPY flag has an
>> overhead, and it will not be desirable to included for all migrations,
>> specially is they can be nicely migrated with pre-copy mode. In addition, if
>> the migration fails after the switching, the VM will be lost. Therefore,
>> admins may want to ensure that post-copy is not used for some specific VMs.
> We shouldn't be trying to run before we can walk. Even if post-copy
> is hurts some guests, it'll still be a net win overall because it will
> give a guarantee that migration can complete without needing to stop
> guest CPUs entirely. All we need to start with is a nova.conf setting
> to let admin turn off use of post-copy for the host for cases where
> we want to priortize performance over the ability to migrate successfully.
>
> Any plan wrt changing migration behaviour on a per-VM basis needs to
> consider a much broader set of features than just post-copy. For example,
> compression, autoconverge and max-downtime settings all have an overhead
> or impact on the guest too. We don't want to end up exposing API flags to
> turn any of these on/off individually. So any solution to this will have
> to look at a combination of usage context and some kind of SLA marker on
> the guest. eg if the migration is in the context of host-evacuate which
> absolutely must always complete in finite time, we should always use
> post-copy. If the migration is in the context of load-balancing workloads
> across hosts, then some aspect of guest SLA must inform whether Nova chooses
> to use post-copy, or compression or auto-converge, etc.
>
> Regards,
> Daniel
We talked about the SLA issue at the mid cycle. I seem to recall saying
I'd propose a spec for Newton so I should probably get to that.
The idea discussed then was to define instances as Cattle, Pets and
Pandas where cattle are expendable, Pets are less so and Pandas are high
value instances.
I also believe we need to know how important the migration is. For
example if the operator is trying to empty a node due because they are
concerned it is likely to fail then they set the migration as a high
importance task. On the other hand if they are moving instances as
part of a monthly maintenance task they may be more relaxed about the
outcome. If the migration is part of a de-fragmentation exercise the
operator might be fine with some instances not being able to be moved.
So my suggestion is we have add a flag to the live-migration operation
to allow the operator to specify high, medium or low importance. When
the migration is in progress the compute manager can use this setting
in conjunction with the instance SLA to determine how aggressive it
should be in trying to get the migration completed.
Paul Carlton
Software Engineer
Cloud Services
Hewlett Packard
BUK03:T242
Longdown Avenue
Stoke Gifford
Bristol BS34 8QZ
Mobile: +44 (0)7768 994283
Email: mailto:paul.carlton2 at hpe.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160405/d91c5aaf/attachment.html>
More information about the OpenStack-dev
mailing list