[nova][ops] Trying to get per-instance live migration timeout action spec unstuck
Matt Riedemann
mriedemos at gmail.com
Wed Dec 19 02:04:00 UTC 2018
I'm looking at this (previously approved [1]) spec again [2] and trying
to sort out what needs to happen to reach agreement on this feature.
Note the dependent blueprint is now complete in Stein [3].
The idea is pretty simple: provide new parameters to the live migration
API to (1) override [libvirt]/live_migration_completion_timeout [4]
and/or (2) provide a timeout action in case the provided (or configured)
timeout is reached which would override
[libvirt]/live_migration_timeout_action [5].
The use case is also pretty simple: you can have a default timeout and
action (abort) configured but there could be cases where you need to
override that on a per-instance basis to move a set of VMs off a host
for maintenance, so you want to tell nova to force complete (post-copy
or pause) in case of a timeout.
The abort and force-complete actions are the same as in the API ([6] and
[7] respectively).
There are two main sticking points against this in the review:
1. This can already be done using existing APIs (as noted) client-side
if monitoring the live migration and it times out for whatever you
consider a reasonable timeout at the time.
2. The libvirt driver is the only one that currently supports abort and
force-complete.
For #1, while valid as a workaround, is less than ideal since it would
mean having to orchestrate that into any tooling that needs that kind of
workaround, be that OSC, openstacksdk, python-novaclient, gophercloud,
etc. I think it would be relatively simple to pass those parameters
through with the live migration request down to nova-compute and have
the parameters override the config options and then it's natively
supported in the API.
For #2, while also true, I think is not a great reason *not* to support
per-instance timeouts/actions in the API when we already have existing
APIs that do the same thing and have the same backend compute driver
limitations. To ease this, I think we can sort out two things:
a) Can other virt drivers that support live migration (xenapi, hyperv,
vmware in tree, and powervm out of tree) also support abort and
force-complete actions? John Garbutt at least thought it should be
possible for xenapi at the Stein PTG. I don't know about the others -
driver maintainers please speak up here. The next challenge would be
getting driver maintainers to actually add that feature parity, but that
need not be a priority for Stein as long as we know it's possible to add
the support eventually.
b) There are pre-live migration checks that happen on the source compute
before we initiate the actual guest transfer. If a user (admin)
specified these new parameters and the driver does not support them, we
could fail the live migration early. This wouldn't change the instance
status but the migration would fail and an instance action event would
be recorded to explain why it didn't work, and then the admin can retry
without those parameters. This would shield us from exposing something
in the API that could give a false sense of functionality when the
backend doesn't support it.
Given all of this, are these reasonable compromises to continue trying
to drive this feature forward, and more importantly, are other operators
looking to see this functionality added to nova? Huawei public cloud
operators want it because they routinely are doing live migrations as
part of maintenance activities and want to be able to control these
values per-instance. I assume there are other deployments that would
like the same.
If this is something you'd like to see move forward, please speak up
soon since the nova spec freeze for Stein is January 10.
[1]
https://specs.openstack.org/openstack/nova-specs/specs/pike/approved/live-migration-per-instance-timeout.html
[2] https://review.openstack.org/#/c/600613/
[3]
https://blueprints.launchpad.net/nova/+spec/live-migration-force-after-timeout
[4]
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.live_migration_completion_timeout
[5]
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.live_migration_timeout_action
[6]
https://developer.openstack.org/api-ref/compute/?expanded=#delete-abort-migration
[7]
https://developer.openstack.org/api-ref/compute/?expanded=#force-migration-complete-action-force-complete-action
--
Thanks,
Matt
More information about the openstack-discuss
mailing list