Re: [nova][ops] Trying to get per-instance live migration timeout action spec unstuck

3 Jan 2019

      ...
1. This can already be done using existing APIs (as noted) client-side
if monitoring the live migration and it times out for whatever you
consider a reasonable timeout at the time.
There's another thing to point out here, which is that this is also
already doable by adjusting (rightly libvirt-specific) config tunables
on a compute node that is being evacuated. Those could be
hot-reloadable, meaning they could be changed without restarting the
compute service when the evac process begins. It doesn't let you control
it per-instance, granted, but there *is* a server-side solution to this
based on existing stuff.
...
2. The libvirt driver is the only one that currently supports abort
and force-complete.
For #1, while valid as a workaround, is less than ideal since it would
mean having to orchestrate that into any tooling that needs that kind
of workaround, be that OSC, openstacksdk, python-novaclient,
gophercloud, etc. I think it would be relatively simple to pass those
parameters through with the live migration request down to
nova-compute and have the parameters override the config options and
then it's natively supported in the API.
For #2, while also true, I think is not a great reason *not* to
support per-instance timeouts/actions in the API when we already have
existing APIs that do the same thing and have the same backend compute
driver limitations. To ease this, I think we can sort out two things:
a) Can other virt drivers that support live migration (xenapi, hyperv,
vmware in tree, and powervm out of tree) also support abort and
force-complete actions? John Garbutt at least thought it should be
possible for xenapi at the Stein PTG. I don't know about the others - 
driver maintainers please speak up here. The next challenge would be
getting driver maintainers to actually add that feature parity, but
that need not be a priority for Stein as long as we know it's possible
to add the support eventually.
I think that we asked Eric and he said that powervm would/could not
support such a thing because they hand the process off to the hypevisor
and don't pay attention to what happens after that (and/or can't cancel
it). I know John said he thought it would be doable for xenapi, but even
if it is, I'm not expecting it will happen.

I'd definitely like to hear from the others.
...
b) There are pre-live migration checks that happen on the source
compute before we initiate the actual guest transfer. If a user
(admin) specified these new parameters and the driver does not support
them, we could fail the live migration early. This wouldn't change the
instance status but the migration would fail and an instance action
event would be recorded to explain why it didn't work, and then the
admin can retry without those parameters. This would shield us from
exposing something in the API that could give a false sense of
functionality when the backend doesn't support it.
This is better than nothing, granted. What I'm concerned about is not
that $driver never supports these, but rather that $driver shows up
later and wants *different* parameters. Or even that libvirt/kvm
migration changes in such a way that these no longer make sense even for
it. We already have an example this in-tree today, where the
recently-added libvirt post-copy mode makes the 'abort' option invalid.
...
Given all of this, are these reasonable compromises to continue trying
to drive this feature forward, and more importantly, are other
operators looking to see this functionality added to nova? Huawei
public cloud operators want it because they routinely are doing live
migrations as part of maintenance activities and want to be able to
control these values per-instance. I assume there are other
deployments that would like the same.
I don't need to hold this up if everyone else is on board, but I don't
really want to +2 it. I'll commit to not -1ing it if it specifically
confirms support before starting a migration that won't honor the
requested limits.

--Dan

Re: [nova][ops] Trying to get per-instance live migration timeout action spec unstuck

Dan Smith