[nova][cinder] An API to replace volume-update (aka swap-volume)

23 Apr 2019

      Consider this a straw-man proposal for the upcoming cross-project
discussion at PTG. I'm not (currently, at any rate) proposing to do
this work myself, which is also why I haven't put this in a spec.

Volume-update is a problematic interface. There are likely many more
than this, but problems which come to mind are:

* It shares state between cinder and nova while running. A failure of
either leaves an inconsistent state which is not easily recoverable.
* It is fragile: it requires an instance is running. Stopping an
instance while it is in progress is undefined at best. It will fail if
an instance is not running.
* It is mis-used by end-users to copy data between different volumes,
which results in an unsupportable strange instance state.
* It is slow.
* It results in data corruption for multiattached volumes

I'd like to propose the following api as a replacement. I'll describe
the contract up top and put some example implementations below. This
new flow for volume migration would be driven entirely by cinder, with
no callbacks from nova, so all of the following are new nova apis.

* volume-migration-start(src_attachment_id)

After returning, nova will not write any data to src until completion
or cancelled. Nova will return an error if it is not able to suspend
writes to src.

Start is a synchronous operation. Nova will not return until it can
guarantee that there will be no further writes to src.

* volume-migration-cancel(src_attachment_id)

Nova will resume writing to src. A subsequent complete call will
return an error.

cancel is an asynchronous operation. Nova assumes it can resume
writing to src immediately, but does not guarantee when it will finish
the cancel operation. This has no impact on cinder, but nova instance
operations may continue to be restricted until the cancel completes.

* volume-migration-complete(src_attachment_id, dst_attachment_id)

Nova expects that dst is a copy of src at the time that
volume-migration-start() was called. Nova will detach from src and
attach dst in its place. Nova will resume reading and writing to dst
immediately.

Nova will not return until it is safe for cinder to remove src, but
the complete operation is not guaranteed to have finished at this
time. This has no impact on cinder, but nova instance operations may
continue to be restricted until it completes asynchronously.

The implementation on the cinder side would be:

volume-migration-start(src_attachment_id)
copy(src, dst)
volume-migration-complete(src, dst)

Cinder doesn't need to be concerned with whether the instance or
running or not, and it always does the copy itself. Using
driver-specific optimisations this has the potential to be very fast,
but even the fallback implementation of a full copy between 2 arrays
should be substantially faster than a qemu block job on the compute
host.

It is entirely up to cinder to ensure this operation is safe wrt
multiattach. I expect 'safe' here to mean it will refuse to do the
operation.

There may also be details to be worked out wrt who creates the dst
attachment. If at all possible, I don't want to expose to nova the
quirk that under cinder's covers a new volume is created which
temporarily has the wrong volume id. I'd prefer that, from nova's POV,
the operation is just a managed switch between 2 attachments to
apparently the same volume.

2 possible implementations on the Nova side:

1. Hypervisor does not support live storage migration
Summary: instance must be shutdown.

volume-migration-start() returns success iff the instance is shutdown.
Nova will set a task state such that instance cannot be started until
cancel or complete.

volume-migration-complete() reconfigures the specific volume
attachment and unsets the task state so the instance can be started.

2. libvirt driver
Summary: libvirt allows writes to continue during the copy by sending
them to a local snapshot, which it commits to dst on complete().

volume-migration-start() sets task state such that instance cannot be
started or stopped until cancel or completion.

If instance is not running, does nothing else.

If instance is running, creates local writeable qcow2 backed by
read-only src volume. Swaps volume to local qcow2.

volume-migration-cancel() does nothing if instance is not running
except unset task state. If instance is running returns success to
cinder immediately before commiting qcow2 to src volume and then
unsetting task state.

volume-migration-complete() updates instance to dst if it's not
running, unsets task state and returns immediately. If instance is
running, rebases local qcow2 from src to dst (cinder guarantees us
that these are identical), then returns success to cinder because src
can now be released. Nova starts a job to commit qcow2 to dst. On
completion it updates instance to use dst and unsets task state.

This API does not handle the case where the hypervisor is only capable
of live storage migration where it manages the entire copy itself. If
we had a requirement to support this we might change
volume-migration-start() to return something indicating that nova will
do the copy, or perhaps indicating that cinder should try again with a
different entry point. However, I believe this approach will always be
slower than a cinder-native copy, and I can't see any way around it
requiring a callback to cinder on completion of the copy. I'd very
much like some feedback from non-libvirt hypervisor folks around
whether they're interested in this functionality at all, and if so
what their capabilities are.

Matt
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

Matthew Booth

Sean McGinnis

Matthew Booth

Matthew Booth

tags

participants (2)