[nova][cinder] An API to replace volume-update (aka swap-volume)
Consider this a straw-man proposal for the upcoming cross-project discussion at PTG. I'm not (currently, at any rate) proposing to do this work myself, which is also why I haven't put this in a spec. Volume-update is a problematic interface. There are likely many more than this, but problems which come to mind are: * It shares state between cinder and nova while running. A failure of either leaves an inconsistent state which is not easily recoverable. * It is fragile: it requires an instance is running. Stopping an instance while it is in progress is undefined at best. It will fail if an instance is not running. * It is mis-used by end-users to copy data between different volumes, which results in an unsupportable strange instance state. * It is slow. * It results in data corruption for multiattached volumes I'd like to propose the following api as a replacement. I'll describe the contract up top and put some example implementations below. This new flow for volume migration would be driven entirely by cinder, with no callbacks from nova, so all of the following are new nova apis. * volume-migration-start(src_attachment_id) After returning, nova will not write any data to src until completion or cancelled. Nova will return an error if it is not able to suspend writes to src. Start is a synchronous operation. Nova will not return until it can guarantee that there will be no further writes to src. * volume-migration-cancel(src_attachment_id) Nova will resume writing to src. A subsequent complete call will return an error. cancel is an asynchronous operation. Nova assumes it can resume writing to src immediately, but does not guarantee when it will finish the cancel operation. This has no impact on cinder, but nova instance operations may continue to be restricted until the cancel completes. * volume-migration-complete(src_attachment_id, dst_attachment_id) Nova expects that dst is a copy of src at the time that volume-migration-start() was called. Nova will detach from src and attach dst in its place. Nova will resume reading and writing to dst immediately. Nova will not return until it is safe for cinder to remove src, but the complete operation is not guaranteed to have finished at this time. This has no impact on cinder, but nova instance operations may continue to be restricted until it completes asynchronously. The implementation on the cinder side would be: volume-migration-start(src_attachment_id) copy(src, dst) volume-migration-complete(src, dst) Cinder doesn't need to be concerned with whether the instance or running or not, and it always does the copy itself. Using driver-specific optimisations this has the potential to be very fast, but even the fallback implementation of a full copy between 2 arrays should be substantially faster than a qemu block job on the compute host. It is entirely up to cinder to ensure this operation is safe wrt multiattach. I expect 'safe' here to mean it will refuse to do the operation. There may also be details to be worked out wrt who creates the dst attachment. If at all possible, I don't want to expose to nova the quirk that under cinder's covers a new volume is created which temporarily has the wrong volume id. I'd prefer that, from nova's POV, the operation is just a managed switch between 2 attachments to apparently the same volume. 2 possible implementations on the Nova side: 1. Hypervisor does not support live storage migration Summary: instance must be shutdown. volume-migration-start() returns success iff the instance is shutdown. Nova will set a task state such that instance cannot be started until cancel or complete. volume-migration-complete() reconfigures the specific volume attachment and unsets the task state so the instance can be started. 2. libvirt driver Summary: libvirt allows writes to continue during the copy by sending them to a local snapshot, which it commits to dst on complete(). volume-migration-start() sets task state such that instance cannot be started or stopped until cancel or completion. If instance is not running, does nothing else. If instance is running, creates local writeable qcow2 backed by read-only src volume. Swaps volume to local qcow2. volume-migration-cancel() does nothing if instance is not running except unset task state. If instance is running returns success to cinder immediately before commiting qcow2 to src volume and then unsetting task state. volume-migration-complete() updates instance to dst if it's not running, unsets task state and returns immediately. If instance is running, rebases local qcow2 from src to dst (cinder guarantees us that these are identical), then returns success to cinder because src can now be released. Nova starts a job to commit qcow2 to dst. On completion it updates instance to use dst and unsets task state. This API does not handle the case where the hypervisor is only capable of live storage migration where it manages the entire copy itself. If we had a requirement to support this we might change volume-migration-start() to return something indicating that nova will do the copy, or perhaps indicating that cinder should try again with a different entry point. However, I believe this approach will always be slower than a cinder-native copy, and I can't see any way around it requiring a callback to cinder on completion of the copy. I'd very much like some feedback from non-libvirt hypervisor folks around whether they're interested in this functionality at all, and if so what their capabilities are. Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK)
I'd like to propose the following api as a replacement. I'll describe the contract up top and put some example implementations below. This new flow for volume migration would be driven entirely by cinder, with no callbacks from nova, so all of the following are new nova apis.
Overall I really like this proposal from a high level. Some additional data points below that folks need to keep in mind as we think through how this could work.
Nova expects that dst is a copy of src at the time that volume-migration-start() was called. Nova will detach from src and attach dst in its place. Nova will resume reading and writing to dst immediately.
You actually mean volume-migration-complete, right?
The implementation on the cinder side would be:
volume-migration-start(src_attachment_id) copy(src, dst) volume-migration-complete(src, dst)
Cinder doesn't need to be concerned with whether the instance or running or not, and it always does the copy itself.
This could be problematic. This works great if the migration is from one volume to another volume on the same backend. In that case it could be very fast (for some backends almost instantaneous). If the migration needs to take place between two different backends, then we could have a problem. In that case, both the source and destination volumes would need to be mounted on some host to perform basically a dd from one to the other. The Nova host already has access to the volume, so doing it there avoids the need to set up any other access to the volumes. Otherwise we need to mount both volumes to the c-vol host to do the copy, putting Cinder in the data path. We have some existing gotchas with things like image copy for this that we've considered creating some kind of c-data data mover service that could be scaled out to multiple nodes and support different protocols. The biggest issue today is ensuring the Cinder node supports the storage transport protocol needed to access the volume - fibre channel HBAs or network connectivity for iSCSI. Not saying this is a showstopper. I'm just making sure folks are aware of it so we don't start off under any assumptions that having Cinder take care of the migration is going to be a no brainer.
On Tue, 23 Apr 2019 at 14:55, Sean McGinnis <sean.mcginnis@gmx.com> wrote:
I'd like to propose the following api as a replacement. I'll describe the contract up top and put some example implementations below. This new flow for volume migration would be driven entirely by cinder, with no callbacks from nova, so all of the following are new nova apis.
Overall I really like this proposal from a high level. Some additional data points below that folks need to keep in mind as we think through how this could work.
Nova expects that dst is a copy of src at the time that volume-migration-start() was called. Nova will detach from src and attach dst in its place. Nova will resume reading and writing to dst immediately.
You actually mean volume-migration-complete, right?
No. src goes read-only after start(), then cinder copies it to dst. So at the time cinder calls complete(), dst is going to be a copy of src from the time start() was called. cinder doesn't know (or need to know) anything about the overlay nova created locally. All cinder cares about is that nothing writes to src after it calls start(), and it must copy this data to dst.
The implementation on the cinder side would be:
volume-migration-start(src_attachment_id) copy(src, dst) volume-migration-complete(src, dst)
Cinder doesn't need to be concerned with whether the instance or running or not, and it always does the copy itself.
This could be problematic.
This works great if the migration is from one volume to another volume on the same backend. In that case it could be very fast (for some backends almost instantaneous).
If the migration needs to take place between two different backends, then we could have a problem. In that case, both the source and destination volumes would need to be mounted on some host to perform basically a dd from one to the other.
The Nova host already has access to the volume, so doing it there avoids the need to set up any other access to the volumes. Otherwise we need to mount both volumes to the c-vol host to do the copy, putting Cinder in the data path.
We have some existing gotchas with things like image copy for this that we've considered creating some kind of c-data data mover service that could be scaled out to multiple nodes and support different protocols. The biggest issue today is ensuring the Cinder node supports the storage transport protocol needed to access the volume - fibre channel HBAs or network connectivity for iSCSI.
Right, I was assuming there would be a fallback, and it would basically be as slow and ugly as if nova did it. The difference is that if nova does it it's always going to be slow as we can never take advantage of backend features, and it's always going to be ugly because storage isn't our thing, and we have no plans for an n-data data mover service ;)
Not saying this is a showstopper. I'm just making sure folks are aware of it so we don't start off under any assumptions that having Cinder take care of the migration is going to be a no brainer.
Understood. Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK)
There's something I didn't mention below, which is which interface to use to make the rpc calls described. We currently use the public-facing rest api, which is a problem when users call it in ways we didn't intend. This isn't really an end-user feature, though, so if there were some other private rpc mechanism that would be good to explore. Matt Riedemann mentioned os-server-external-events in IRC, which we use for volume extend. I haven't looked into this yet, but my first thought is that it sounds like an asynchronous notification mechanism. If so we unfortunately wouldn't be able to use it for the proposal below, as we would need something synchronous. For example, in the proposal below cinder needs to call start() *and wait for nova to return success* before it is safe to start the data copy. This isn't the case for volume extend, where cinder simply notifies nova on completion and no further coordination is required. However, if there is any way to use os-server-external-events, or any other non-public rpc mechanism, it would be good to discuss that. Matt On Tue, 23 Apr 2019 at 14:08, Matthew Booth <mbooth@redhat.com> wrote:
Consider this a straw-man proposal for the upcoming cross-project discussion at PTG. I'm not (currently, at any rate) proposing to do this work myself, which is also why I haven't put this in a spec.
Volume-update is a problematic interface. There are likely many more than this, but problems which come to mind are:
* It shares state between cinder and nova while running. A failure of either leaves an inconsistent state which is not easily recoverable. * It is fragile: it requires an instance is running. Stopping an instance while it is in progress is undefined at best. It will fail if an instance is not running. * It is mis-used by end-users to copy data between different volumes, which results in an unsupportable strange instance state. * It is slow. * It results in data corruption for multiattached volumes
I'd like to propose the following api as a replacement. I'll describe the contract up top and put some example implementations below. This new flow for volume migration would be driven entirely by cinder, with no callbacks from nova, so all of the following are new nova apis.
* volume-migration-start(src_attachment_id)
After returning, nova will not write any data to src until completion or cancelled. Nova will return an error if it is not able to suspend writes to src.
Start is a synchronous operation. Nova will not return until it can guarantee that there will be no further writes to src.
* volume-migration-cancel(src_attachment_id)
Nova will resume writing to src. A subsequent complete call will return an error.
cancel is an asynchronous operation. Nova assumes it can resume writing to src immediately, but does not guarantee when it will finish the cancel operation. This has no impact on cinder, but nova instance operations may continue to be restricted until the cancel completes.
* volume-migration-complete(src_attachment_id, dst_attachment_id)
Nova expects that dst is a copy of src at the time that volume-migration-start() was called. Nova will detach from src and attach dst in its place. Nova will resume reading and writing to dst immediately.
Nova will not return until it is safe for cinder to remove src, but the complete operation is not guaranteed to have finished at this time. This has no impact on cinder, but nova instance operations may continue to be restricted until it completes asynchronously.
The implementation on the cinder side would be:
volume-migration-start(src_attachment_id) copy(src, dst) volume-migration-complete(src, dst)
Cinder doesn't need to be concerned with whether the instance or running or not, and it always does the copy itself. Using driver-specific optimisations this has the potential to be very fast, but even the fallback implementation of a full copy between 2 arrays should be substantially faster than a qemu block job on the compute host.
It is entirely up to cinder to ensure this operation is safe wrt multiattach. I expect 'safe' here to mean it will refuse to do the operation.
There may also be details to be worked out wrt who creates the dst attachment. If at all possible, I don't want to expose to nova the quirk that under cinder's covers a new volume is created which temporarily has the wrong volume id. I'd prefer that, from nova's POV, the operation is just a managed switch between 2 attachments to apparently the same volume.
2 possible implementations on the Nova side:
1. Hypervisor does not support live storage migration Summary: instance must be shutdown.
volume-migration-start() returns success iff the instance is shutdown. Nova will set a task state such that instance cannot be started until cancel or complete.
volume-migration-complete() reconfigures the specific volume attachment and unsets the task state so the instance can be started.
2. libvirt driver Summary: libvirt allows writes to continue during the copy by sending them to a local snapshot, which it commits to dst on complete().
volume-migration-start() sets task state such that instance cannot be started or stopped until cancel or completion.
If instance is not running, does nothing else.
If instance is running, creates local writeable qcow2 backed by read-only src volume. Swaps volume to local qcow2.
volume-migration-cancel() does nothing if instance is not running except unset task state. If instance is running returns success to cinder immediately before commiting qcow2 to src volume and then unsetting task state.
volume-migration-complete() updates instance to dst if it's not running, unsets task state and returns immediately. If instance is running, rebases local qcow2 from src to dst (cinder guarantees us that these are identical), then returns success to cinder because src can now be released. Nova starts a job to commit qcow2 to dst. On completion it updates instance to use dst and unsets task state.
This API does not handle the case where the hypervisor is only capable of live storage migration where it manages the entire copy itself. If we had a requirement to support this we might change volume-migration-start() to return something indicating that nova will do the copy, or perhaps indicating that cinder should try again with a different entry point. However, I believe this approach will always be slower than a cinder-native copy, and I can't see any way around it requiring a callback to cinder on completion of the copy. I'd very much like some feedback from non-libvirt hypervisor folks around whether they're interested in this functionality at all, and if so what their capabilities are.
Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG
Phone: +442070094448 (UK)
-- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK)
participants (2)
-
Matthew Booth
-
Sean McGinnis