[nova][ops] Trying to get per-instance live migration timeout action spec unstuck
I'm looking at this (previously approved [1]) spec again [2] and trying to sort out what needs to happen to reach agreement on this feature. Note the dependent blueprint is now complete in Stein [3]. The idea is pretty simple: provide new parameters to the live migration API to (1) override [libvirt]/live_migration_completion_timeout [4] and/or (2) provide a timeout action in case the provided (or configured) timeout is reached which would override [libvirt]/live_migration_timeout_action [5]. The use case is also pretty simple: you can have a default timeout and action (abort) configured but there could be cases where you need to override that on a per-instance basis to move a set of VMs off a host for maintenance, so you want to tell nova to force complete (post-copy or pause) in case of a timeout. The abort and force-complete actions are the same as in the API ([6] and [7] respectively). There are two main sticking points against this in the review: 1. This can already be done using existing APIs (as noted) client-side if monitoring the live migration and it times out for whatever you consider a reasonable timeout at the time. 2. The libvirt driver is the only one that currently supports abort and force-complete. For #1, while valid as a workaround, is less than ideal since it would mean having to orchestrate that into any tooling that needs that kind of workaround, be that OSC, openstacksdk, python-novaclient, gophercloud, etc. I think it would be relatively simple to pass those parameters through with the live migration request down to nova-compute and have the parameters override the config options and then it's natively supported in the API. For #2, while also true, I think is not a great reason *not* to support per-instance timeouts/actions in the API when we already have existing APIs that do the same thing and have the same backend compute driver limitations. To ease this, I think we can sort out two things: a) Can other virt drivers that support live migration (xenapi, hyperv, vmware in tree, and powervm out of tree) also support abort and force-complete actions? John Garbutt at least thought it should be possible for xenapi at the Stein PTG. I don't know about the others - driver maintainers please speak up here. The next challenge would be getting driver maintainers to actually add that feature parity, but that need not be a priority for Stein as long as we know it's possible to add the support eventually. b) There are pre-live migration checks that happen on the source compute before we initiate the actual guest transfer. If a user (admin) specified these new parameters and the driver does not support them, we could fail the live migration early. This wouldn't change the instance status but the migration would fail and an instance action event would be recorded to explain why it didn't work, and then the admin can retry without those parameters. This would shield us from exposing something in the API that could give a false sense of functionality when the backend doesn't support it. Given all of this, are these reasonable compromises to continue trying to drive this feature forward, and more importantly, are other operators looking to see this functionality added to nova? Huawei public cloud operators want it because they routinely are doing live migrations as part of maintenance activities and want to be able to control these values per-instance. I assume there are other deployments that would like the same. If this is something you'd like to see move forward, please speak up soon since the nova spec freeze for Stein is January 10. [1] https://specs.openstack.org/openstack/nova-specs/specs/pike/approved/live-mi... [2] https://review.openstack.org/#/c/600613/ [3] https://blueprints.launchpad.net/nova/+spec/live-migration-force-after-timeo... [4] https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv... [5] https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv... [6] https://developer.openstack.org/api-ref/compute/?expanded=#delete-abort-migr... [7] https://developer.openstack.org/api-ref/compute/?expanded=#force-migration-c... -- Thanks, Matt
On 12/18/2018 8:04 PM, Matt Riedemann wrote:
There are two main sticking points against this in the review:
1. This can already be done using existing APIs (as noted) client-side if monitoring the live migration and it times out for whatever you consider a reasonable timeout at the time.
2. The libvirt driver is the only one that currently supports abort and force-complete.
For #1, while valid as a workaround, is less than ideal since it would mean having to orchestrate that into any tooling that needs that kind of workaround, be that OSC, openstacksdk, python-novaclient, gophercloud, etc. I think it would be relatively simple to pass those parameters through with the live migration request down to nova-compute and have the parameters override the config options and then it's natively supported in the API.
I agree that it would be cleaner to support it in one place rather than needing to add timeout handling to all the various clients.
For #2, while also true, I think is not a great reason *not* to support per-instance timeouts/actions in the API when we already have existing APIs that do the same thing and have the same backend compute driver limitations. To ease this, I think we can sort out two things:
<snip>
b) There are pre-live migration checks that happen on the source compute before we initiate the actual guest transfer. If a user (admin) specified these new parameters and the driver does not support them, we could fail the live migration early. This wouldn't change the instance status but the migration would fail and an instance action event would be recorded to explain why it didn't work, and then the admin can retry without those parameters. This would shield us from exposing something in the API that could give a false sense of functionality when the backend doesn't support it.
I think this would be a reasonable way to handle it.
Given all of this, are these reasonable compromises to continue trying to drive this feature forward, and more importantly, are other operators looking to see this functionality added to nova? Huawei public cloud operators want it because they routinely are doing live migrations as part of maintenance activities and want to be able to control these values per-instance. I assume there are other deployments that would like the same.
We added nova extensions to the existing Wind River Titanium Cloud product to allow more control over the handling of live migrations because they're frequently used by our operators and have caused issues in the past. The new StarlingX project is more aligned with upstream, so it'd be good to have some sort of per-migration options available. Chris
Given all of this, are these reasonable compromises to continue trying to drive this feature forward, and more importantly, are other operators looking to see this functionality added to nova? Huawei public cloud operators want it because they routinely are doing live migrations as part of maintenance activities and want to be able to control these values per-instance. I assume there are other deployments that would like the same.
I would say that any Telco would be very happy to have this in Nova. Br, Tomi
1. This can already be done using existing APIs (as noted) client-side if monitoring the live migration and it times out for whatever you consider a reasonable timeout at the time.
There's another thing to point out here, which is that this is also already doable by adjusting (rightly libvirt-specific) config tunables on a compute node that is being evacuated. Those could be hot-reloadable, meaning they could be changed without restarting the compute service when the evac process begins. It doesn't let you control it per-instance, granted, but there *is* a server-side solution to this based on existing stuff.
2. The libvirt driver is the only one that currently supports abort and force-complete.
For #1, while valid as a workaround, is less than ideal since it would mean having to orchestrate that into any tooling that needs that kind of workaround, be that OSC, openstacksdk, python-novaclient, gophercloud, etc. I think it would be relatively simple to pass those parameters through with the live migration request down to nova-compute and have the parameters override the config options and then it's natively supported in the API.
For #2, while also true, I think is not a great reason *not* to support per-instance timeouts/actions in the API when we already have existing APIs that do the same thing and have the same backend compute driver limitations. To ease this, I think we can sort out two things:
a) Can other virt drivers that support live migration (xenapi, hyperv, vmware in tree, and powervm out of tree) also support abort and force-complete actions? John Garbutt at least thought it should be possible for xenapi at the Stein PTG. I don't know about the others - driver maintainers please speak up here. The next challenge would be getting driver maintainers to actually add that feature parity, but that need not be a priority for Stein as long as we know it's possible to add the support eventually.
I think that we asked Eric and he said that powervm would/could not support such a thing because they hand the process off to the hypevisor and don't pay attention to what happens after that (and/or can't cancel it). I know John said he thought it would be doable for xenapi, but even if it is, I'm not expecting it will happen. I'd definitely like to hear from the others.
b) There are pre-live migration checks that happen on the source compute before we initiate the actual guest transfer. If a user (admin) specified these new parameters and the driver does not support them, we could fail the live migration early. This wouldn't change the instance status but the migration would fail and an instance action event would be recorded to explain why it didn't work, and then the admin can retry without those parameters. This would shield us from exposing something in the API that could give a false sense of functionality when the backend doesn't support it.
This is better than nothing, granted. What I'm concerned about is not that $driver never supports these, but rather that $driver shows up later and wants *different* parameters. Or even that libvirt/kvm migration changes in such a way that these no longer make sense even for it. We already have an example this in-tree today, where the recently-added libvirt post-copy mode makes the 'abort' option invalid.
Given all of this, are these reasonable compromises to continue trying to drive this feature forward, and more importantly, are other operators looking to see this functionality added to nova? Huawei public cloud operators want it because they routinely are doing live migrations as part of maintenance activities and want to be able to control these values per-instance. I assume there are other deployments that would like the same.
I don't need to hold this up if everyone else is on board, but I don't really want to +2 it. I'll commit to not -1ing it if it specifically confirms support before starting a migration that won't honor the requested limits. --Dan
On 1/3/2019 3:57 PM, Dan Smith wrote:
Or even that libvirt/kvm migration changes in such a way that these no longer make sense even for it. We already have an example this in-tree today, where the recently-added libvirt post-copy mode makes the 'abort' option invalid.
I'm not following you here. As far as I understand, post-copy in the libvirt driver is triggered on the force complete action and only if (1) it's available and (2) nova is configured to allow it, otherwise the force complete action for the libvirt driver pauses the VM. The abort operation aborts the job in libvirt [1] which I believe triggers a rollback [2]. [1] https://github.com/openstack/nova/blob/8ef3d253a086e4f8575f5221d4515cda421ab... [2] https://github.com/openstack/nova/blob/8ef3d253a086e4f8575f5221d4515cda421ab... -- Thanks, Matt
Or even that libvirt/kvm migration changes in such a way that these no longer make sense even for it. We already have an example this in-tree today, where the recently-added libvirt post-copy mode makes the 'abort' option invalid.
I'm not following you here. As far as I understand, post-copy in the libvirt driver is triggered on the force complete action and only if (1) it's available and (2) nova is configured to allow it, otherwise the force complete action for the libvirt driver pauses the VM. The abort operation aborts the job in libvirt [1] which I believe triggers a rollback [2].
[1] https://github.com/openstack/nova/blob/8ef3d253a086e4f8575f5221d4515cda421ab... [2] https://github.com/openstack/nova/blob/8ef3d253a086e4f8575f5221d4515cda421ab...
Because in nova[0] we currently only switch to post-copy after we decide we're not making progress right? If we later allow a configuration where post-copy is the default from the start (as I believe is the actual current recommendation from the virt people[1]), and someone triggers a migration with a short timeout and abort action, we'll not be able to actually do the abort. I'm guessing we'd just need to refuse a request where abort is specified with any timeout if post-copy will be used from the beginning. Since the API user can't know how the virt driver is configured, we just have to refuse to do the migration and hope they'll understand :) 0: Sorry, I shouldn't have said "in tree" because I meant "in the libvirt world" 1: look for "in summary" here: https://www.berrange.com/posts/2016/05/12/analysis-of-techniques-for-ensurin... --Dan
On 1/3/2019 4:37 PM, Dan Smith wrote:
Because in nova[0] we currently only switch to post-copy after we decide we're not making progress right?
If you're referring to the "live_migration_progress_timeout" option that has been deprecated and was replaced in Stein with the live_migration_timeout_action option, which was a pre-requisite for the per-instance timeout + action spec. In Stein, we only switch to post-copy if we hit live_migration_completion_timeout and live_migration_timeout_action=force_complete and live_migration_permit_post_copy=True (and libvirt/qemu are new enough for post-copy), otherwise we pause the guest. So I don't think the stalled progress stuff has applied for awhile (OSIC found problems with it in Ocata and disabled/deprecated it).
If we later allow a configuration where post-copy is the default from the start (as I believe is the actual current recommendation from the virt people[1]), and someone triggers a migration with a short timeout and abort action, we'll not be able to actually do the abort.
Sorry but I don't understand this, how does "post-copy from the start" apply? If I specify a short timeout and abort action in the API, and the timeout is reached before the migration is complete, it should abort, just like if I abort it via the API. As noted above, post-copy should only be triggered once we reach the timeout, and if you overwrite that action to abort (per instance, in the API), it should abort rather than switch to post-copy. -- Thanks, Matt
Matt Riedemann <mriedemos@gmail.com> writes:
On 1/3/2019 4:37 PM, Dan Smith wrote:
Because in nova[0] we currently only switch to post-copy after we decide we're not making progress right?
If you're referring to the "live_migration_progress_timeout" option that has been deprecated and was replaced in Stein with the live_migration_timeout_action option, which was a pre-requisite for the per-instance timeout + action spec.
In Stein, we only switch to post-copy if we hit live_migration_completion_timeout and live_migration_timeout_action=force_complete and live_migration_permit_post_copy=True (and libvirt/qemu are new enough for post-copy), otherwise we pause the guest.
So I don't think the stalled progress stuff has applied for awhile (OSIC found problems with it in Ocata and disabled/deprecated it).
Yeah, I'm trying to point out something _other_ than what is currently nova behavior.
If we later allow a configuration where post-copy is the default from the start (as I believe is the actual current recommendation from the virt people[1]), and someone triggers a migration with a short timeout and abort action, we'll not be able to actually do the abort.
Sorry but I don't understand this, how does "post-copy from the start" apply? If I specify a short timeout and abort action in the API, and the timeout is reached before the migration is complete, it should abort, just like if I abort it via the API. As noted above, post-copy should only be triggered once we reach the timeout, and if you overwrite that action to abort (per instance, in the API), it should abort rather than switch to post-copy.
You can't abort a post-copy migration once it has started. If we were to add an "always do post-copy" mode to Nova, per the recommendation from the post I linked, then we would start a migration in post-copy mode, which would make it un-cancel-able. That means not only could you not cancel it, but we would have to refuse to start the migration if the user requested an abort action via this new proposed API with any timeout value. Anyway, my point here is just that libvirt already (but not nova/libvirt yet) has a live migration mode where we would not be able to honor a request of "abort after N seconds". If config specified that, we could warn or fail on startup, but via the API all we'd be able to do is refuse to start the migration. I'm just trying to highlight that baking "force/abort after N seconds" into our API is not only just libvirt-specific at the moment, but even libvirt-pre-copy specific. --Dan
On 1/3/2019 5:45 PM, Dan Smith wrote:
You can't abort a post-copy migration once it has started. If we were to add an "always do post-copy" mode to Nova, per the recommendation from the post I linked, then we would start a migration in post-copy mode, which would make it un-cancel-able. That means not only could you not cancel it, but we would have to refuse to start the migration if the user requested an abort action via this new proposed API with any timeout value.
Anyway, my point here is just that libvirt already (but not nova/libvirt yet) has a live migration mode where we would not be able to honor a request of "abort after N seconds". If config specified that, we could warn or fail on startup, but via the API all we'd be able to do is refuse to start the migration. I'm just trying to highlight that baking "force/abort after N seconds" into our API is not only just libvirt-specific at the moment, but even libvirt-pre-copy specific.
OK, sorry, I'm following you now. I didn't make the connection that you were talking about something we could do in the future (in nova) to initiate the live migration in post-copy mode. Yeah I agree in that case if the user said abort we'd just have to reject it and say you can't do that based on how the source host is configured. -- Thanks, Matt
On Thu, 3 Jan 2019 18:02:16 -0600, Matt Riedemann <mriedemos@gmail.com> wrote:
On 1/3/2019 5:45 PM, Dan Smith wrote:
You can't abort a post-copy migration once it has started. If we were to add an "always do post-copy" mode to Nova, per the recommendation from the post I linked, then we would start a migration in post-copy mode, which would make it un-cancel-able. That means not only could you not cancel it, but we would have to refuse to start the migration if the user requested an abort action via this new proposed API with any timeout value.
Anyway, my point here is just that libvirt already (but not nova/libvirt yet) has a live migration mode where we would not be able to honor a request of "abort after N seconds". If config specified that, we could warn or fail on startup, but via the API all we'd be able to do is refuse to start the migration. I'm just trying to highlight that baking "force/abort after N seconds" into our API is not only just libvirt-specific at the moment, but even libvirt-pre-copy specific.
OK, sorry, I'm following you now. I didn't make the connection that you were talking about something we could do in the future (in nova) to initiate the live migration in post-copy mode. Yeah I agree in that case if the user said abort we'd just have to reject it and say you can't do that based on how the source host is configured.
This seems like a reasonable way to handle the future case of a live migration initiated in post-copy mode. Overall, I'm in support of the idea of adding finer-grained control over live migrations, being that we have multiple operators who've expressed the usefulness they'd get from it and it seems like a relatively simple change. It also sounds like we have answers for the concerns about bad UX by checking pre-live-migration whether the driver supports the new parameters and fail fast in that case. And in the future if we have live migrations able to be initiated in post-copy mode, fail fast with instance action info similarly. -melanie
participants (5)
-
Chris Friesen
-
Dan Smith
-
Juvonen, Tomi (Nokia - FI/Espoo)
-
Matt Riedemann
-
melanie witt