[nova][ops] Trying to get per-instance live migration timeout action spec unstuck

Dan Smith dms at danplanet.com
Thu Jan 3 23:45:25 UTC 2019


Matt Riedemann <mriedemos at gmail.com> writes:

> On 1/3/2019 4:37 PM, Dan Smith wrote:
>> Because in nova[0] we currently only switch to post-copy after we decide
>> we're not making progress right? 
>
> If you're referring to the "live_migration_progress_timeout" option
> that has been deprecated and was replaced in Stein with the
> live_migration_timeout_action option, which was a pre-requisite for
> the per-instance timeout + action spec.
>
> In Stein, we only switch to post-copy if we hit
> live_migration_completion_timeout and
> live_migration_timeout_action=force_complete and
> live_migration_permit_post_copy=True (and libvirt/qemu are new enough
> for post-copy), otherwise we pause the guest.
>
> So I don't think the stalled progress stuff has applied for awhile
> (OSIC found problems with it in Ocata and disabled/deprecated it).

Yeah, I'm trying to point out something _other_ than what is currently
nova behavior.

>> If we later allow a configuration where
>> post-copy is the default from the start (as I believe is the actual
>> current recommendation from the virt people[1]), and someone triggers a
>> migration with a short timeout and abort action, we'll not be able to
>> actually do the abort.
>
> Sorry but I don't understand this, how does "post-copy from the start"
> apply? If I specify a short timeout and abort action in the API, and
> the timeout is reached before the migration is complete, it should
> abort, just like if I abort it via the API. As noted above, post-copy
> should only be triggered once we reach the timeout, and if you
> overwrite that action to abort (per instance, in the API), it should
> abort rather than switch to post-copy.

You can't abort a post-copy migration once it has started. If we were to
add an "always do post-copy" mode to Nova, per the recommendation from
the post I linked, then we would start a migration in post-copy mode,
which would make it un-cancel-able. That means not only could you not
cancel it, but we would have to refuse to start the migration if the
user requested an abort action via this new proposed API with any
timeout value.

Anyway, my point here is just that libvirt already (but not nova/libvirt
yet) has a live migration mode where we would not be able to honor a
request of "abort after N seconds". If config specified that, we could
warn or fail on startup, but via the API all we'd be able to do is
refuse to start the migration. I'm just trying to highlight that
baking "force/abort after N seconds" into our API is not only just
libvirt-specific at the moment, but even libvirt-pre-copy specific.

--Dan



More information about the openstack-discuss mailing list