[openstack-dev] [nova] post-copy live migration
luis at cs.umu.se
Wed Apr 6 06:39:02 UTC 2016
On 04/05/2016 05:33 PM, Daniel P. Berrange wrote:
> On Tue, Apr 05, 2016 at 05:17:41PM +0200, Luis Tomas wrote:
>> We are working on the possibility of including post-copy live migration into
>> Nova (https://review.openstack.org/#/c/301509/)
>> At libvirt level, post-copy live migration works as follow:
>> - Start live migration with a post-copy enabler flag
>> (VIR_MIGRATE_POSTCOPY). Note this does not mean the migration is performed
>> in post-copy mode, just that you can switch it to post-copy at any given
>> - Change the migration from pre-copy to post-copy mode.
>> However, we are not sure what's the most convenient way of providing this
>> functionality at Nova level.
>> The current specs, propose to include an optional flag at the live migration
>> API to include the VIR_MIGRATE_POSTCOPY flag when starting the live
>> migration. Then we propose a second API to actually switch the migration
>> from pre-copy to post-copy mode similarly to how it is done in LibVirt. This
>> is also similar to how the new "force-migrate" option works to ensure
>> migrations completion. In fact, this method could be an extension of the
>> force-migrate, by switching to postcopy if the migration was started with
>> the VIR_MIGRATE_POSTCOPY libvirt flag, or pause it otherwise.
>> The cons of this approach are that we expose a too specific mechanism
>> through the API. To alleviate this, we could remove the "switch" API, and
>> automatize the switch based on data transferred, available bandwidth or
>> other related metrics. However we will still need the extension to the
>> live-migration API to include the proper libvirt postcopy flag.
> No we absolutely don't want to expose that in the API as a concept, as it
> is private technical implementation detail of the KVM migration code.
I see the point and agree on trying to not expose this as an API,
specially the switch. In fact we implemented as part of the ORBIT EU FP7
project post-copy for OpenStack Juno where the switch to post-copy was
automatically triggered after the first iteration of memory copying.
On the other hand, I still see the point of including a flag to decide
the type of migration in a VM basis. Note that, even though what he have
available right now is the QEMU/LibVirt implementation of post-copy,
post-copy in itself is a live migration type (were the migration process
is driven by the destination VM instead of the source VM), regardless of
how it is implemented underneath. Unlike compression, autoconvergence
and max-downtime, which are extra settings of these type of migrations.
>> The other solution is to start all the migrations with the
>> VIR_MIGRATE_POSTCOPY mode, and therefore no new APIs would be needed. The
>> system could automatically detect the migration is taking too long (or is
>> dirting memory faster than the sending rate), and automatically switch to
> Yes this is what we should be doing as default behaviour with new enough
> QEMU IMHO.
>> The cons of this is that including the VIR_MIGRATE_POSTCOPY flag has an
>> overhead, and it will not be desirable to included for all migrations,
>> specially is they can be nicely migrated with pre-copy mode. In addition, if
>> the migration fails after the switching, the VM will be lost. Therefore,
>> admins may want to ensure that post-copy is not used for some specific VMs.
> We shouldn't be trying to run before we can walk. Even if post-copy
> is hurts some guests, it'll still be a net win overall because it will
> give a guarantee that migration can complete without needing to stop
> guest CPUs entirely. All we need to start with is a nova.conf setting
> to let admin turn off use of post-copy for the host for cases where
> we want to priortize performance over the ability to migrate successfully.
My concern here is that it is not only performance, but also reliability
as post-copy migrations cannot be recovered in case of a failure during
the migration process.
> Any plan wrt changing migration behaviour on a per-VM basis needs to
> consider a much broader set of features than just post-copy. For example,
> compression, autoconverge and max-downtime settings all have an overhead
> or impact on the guest too. We don't want to end up exposing API flags to
> turn any of these on/off individually. So any solution to this will have
> to look at a combination of usage context and some kind of SLA marker on
> the guest. eg if the migration is in the context of host-evacuate which
> absolutely must always complete in finite time, we should always use
> post-copy. If the migration is in the context of load-balancing workloads
> across hosts, then some aspect of guest SLA must inform whether Nova chooses
> to use post-copy, or compression or auto-converge, etc.
Thanks for the valuable input and discussion!
Dr. Luis Tomás
Department of Computing Science
luis at cs.umu.se
More information about the OpenStack-dev