There's something I didn't mention below, which is which interface to use to make the rpc calls described. We currently use the public-facing rest api, which is a problem when users call it in ways we didn't intend. This isn't really an end-user feature, though, so if there were some other private rpc mechanism that would be good to explore. Matt Riedemann mentioned os-server-external-events in IRC, which we use for volume extend. I haven't looked into this yet, but my first thought is that it sounds like an asynchronous notification mechanism. If so we unfortunately wouldn't be able to use it for the proposal below, as we would need something synchronous. For example, in the proposal below cinder needs to call start() *and wait for nova to return success* before it is safe to start the data copy. This isn't the case for volume extend, where cinder simply notifies nova on completion and no further coordination is required. However, if there is any way to use os-server-external-events, or any other non-public rpc mechanism, it would be good to discuss that. Matt On Tue, 23 Apr 2019 at 14:08, Matthew Booth <mbooth@redhat.com> wrote:
Consider this a straw-man proposal for the upcoming cross-project discussion at PTG. I'm not (currently, at any rate) proposing to do this work myself, which is also why I haven't put this in a spec.
Volume-update is a problematic interface. There are likely many more than this, but problems which come to mind are:
* It shares state between cinder and nova while running. A failure of either leaves an inconsistent state which is not easily recoverable. * It is fragile: it requires an instance is running. Stopping an instance while it is in progress is undefined at best. It will fail if an instance is not running. * It is mis-used by end-users to copy data between different volumes, which results in an unsupportable strange instance state. * It is slow. * It results in data corruption for multiattached volumes
I'd like to propose the following api as a replacement. I'll describe the contract up top and put some example implementations below. This new flow for volume migration would be driven entirely by cinder, with no callbacks from nova, so all of the following are new nova apis.
* volume-migration-start(src_attachment_id)
After returning, nova will not write any data to src until completion or cancelled. Nova will return an error if it is not able to suspend writes to src.
Start is a synchronous operation. Nova will not return until it can guarantee that there will be no further writes to src.
* volume-migration-cancel(src_attachment_id)
Nova will resume writing to src. A subsequent complete call will return an error.
cancel is an asynchronous operation. Nova assumes it can resume writing to src immediately, but does not guarantee when it will finish the cancel operation. This has no impact on cinder, but nova instance operations may continue to be restricted until the cancel completes.
* volume-migration-complete(src_attachment_id, dst_attachment_id)
Nova expects that dst is a copy of src at the time that volume-migration-start() was called. Nova will detach from src and attach dst in its place. Nova will resume reading and writing to dst immediately.
Nova will not return until it is safe for cinder to remove src, but the complete operation is not guaranteed to have finished at this time. This has no impact on cinder, but nova instance operations may continue to be restricted until it completes asynchronously.
The implementation on the cinder side would be:
volume-migration-start(src_attachment_id) copy(src, dst) volume-migration-complete(src, dst)
Cinder doesn't need to be concerned with whether the instance or running or not, and it always does the copy itself. Using driver-specific optimisations this has the potential to be very fast, but even the fallback implementation of a full copy between 2 arrays should be substantially faster than a qemu block job on the compute host.
It is entirely up to cinder to ensure this operation is safe wrt multiattach. I expect 'safe' here to mean it will refuse to do the operation.
There may also be details to be worked out wrt who creates the dst attachment. If at all possible, I don't want to expose to nova the quirk that under cinder's covers a new volume is created which temporarily has the wrong volume id. I'd prefer that, from nova's POV, the operation is just a managed switch between 2 attachments to apparently the same volume.
2 possible implementations on the Nova side:
1. Hypervisor does not support live storage migration Summary: instance must be shutdown.
volume-migration-start() returns success iff the instance is shutdown. Nova will set a task state such that instance cannot be started until cancel or complete.
volume-migration-complete() reconfigures the specific volume attachment and unsets the task state so the instance can be started.
2. libvirt driver Summary: libvirt allows writes to continue during the copy by sending them to a local snapshot, which it commits to dst on complete().
volume-migration-start() sets task state such that instance cannot be started or stopped until cancel or completion.
If instance is not running, does nothing else.
If instance is running, creates local writeable qcow2 backed by read-only src volume. Swaps volume to local qcow2.
volume-migration-cancel() does nothing if instance is not running except unset task state. If instance is running returns success to cinder immediately before commiting qcow2 to src volume and then unsetting task state.
volume-migration-complete() updates instance to dst if it's not running, unsets task state and returns immediately. If instance is running, rebases local qcow2 from src to dst (cinder guarantees us that these are identical), then returns success to cinder because src can now be released. Nova starts a job to commit qcow2 to dst. On completion it updates instance to use dst and unsets task state.
This API does not handle the case where the hypervisor is only capable of live storage migration where it manages the entire copy itself. If we had a requirement to support this we might change volume-migration-start() to return something indicating that nova will do the copy, or perhaps indicating that cinder should try again with a different entry point. However, I believe this approach will always be slower than a cinder-native copy, and I can't see any way around it requiring a callback to cinder on completion of the copy. I'd very much like some feedback from non-libvirt hypervisor folks around whether they're interested in this functionality at all, and if so what their capabilities are.
Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG
Phone: +442070094448 (UK)
-- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK)