[nova][cinder] An API to replace volume-update (aka swap-volume)

Matthew Booth mbooth at redhat.com
Tue Apr 23 14:00:40 UTC 2019

There's something I didn't mention below, which is which interface to
use to make the rpc calls described. We currently use the
public-facing rest api, which is a problem when users call it in ways
we didn't intend. This isn't really an end-user feature, though, so if
there were some other private rpc mechanism that would be good to
explore. Matt Riedemann mentioned os-server-external-events in IRC,
which we use for volume extend. I haven't looked into this yet, but my
first thought is that it sounds like an asynchronous notification
mechanism. If so we unfortunately wouldn't be able to use it for the
proposal below, as we would need something synchronous. For example,
in the proposal below cinder needs to call start() *and wait for nova
to return success* before it is safe to start the data copy. This
isn't the case for volume extend, where cinder simply notifies nova on
completion and no further coordination is required.

However, if there is any way to use os-server-external-events, or any
other non-public rpc mechanism, it would be good to discuss that.


On Tue, 23 Apr 2019 at 14:08, Matthew Booth <mbooth at redhat.com> wrote:
> Consider this a straw-man proposal for the upcoming cross-project
> discussion at PTG. I'm not (currently, at any rate) proposing to do
> this work myself, which is also why I haven't put this in a spec.
> Volume-update is a problematic interface. There are likely many more
> than this, but problems which come to mind are:
> * It shares state between cinder and nova while running. A failure of
> either leaves an inconsistent state which is not easily recoverable.
> * It is fragile: it requires an instance is running. Stopping an
> instance while it is in progress is undefined at best. It will fail if
> an instance is not running.
> * It is mis-used by end-users to copy data between different volumes,
> which results in an unsupportable strange instance state.
> * It is slow.
> * It results in data corruption for multiattached volumes
> I'd like to propose the following api as a replacement. I'll describe
> the contract up top and put some example implementations below. This
> new flow for volume migration would be driven entirely by cinder, with
> no callbacks from nova, so all of the following are new nova apis.
> * volume-migration-start(src_attachment_id)
> After returning, nova will not write any data to src until completion
> or cancelled. Nova will return an error if it is not able to suspend
> writes to src.
> Start is a synchronous operation. Nova will not return until it can
> guarantee that there will be no further writes to src.
> * volume-migration-cancel(src_attachment_id)
> Nova will resume writing to src. A subsequent complete call will
> return an error.
> cancel is an asynchronous operation. Nova assumes it can resume
> writing to src immediately, but does not guarantee when it will finish
> the cancel operation. This has no impact on cinder, but nova instance
> operations may continue to be restricted until the cancel completes.
> * volume-migration-complete(src_attachment_id, dst_attachment_id)
> Nova expects that dst is a copy of src at the time that
> volume-migration-start() was called. Nova will detach from src and
> attach dst in its place. Nova will resume reading and writing to dst
> immediately.
> Nova will not return until it is safe for cinder to remove src, but
> the complete operation is not guaranteed to have finished at this
> time. This has no impact on cinder, but nova instance operations may
> continue to be restricted until it completes asynchronously.
> The implementation on the cinder side would be:
> volume-migration-start(src_attachment_id)
> copy(src, dst)
> volume-migration-complete(src, dst)
> Cinder doesn't need to be concerned with whether the instance or
> running or not, and it always does the copy itself. Using
> driver-specific optimisations this has the potential to be very fast,
> but even the fallback implementation of a full copy between 2 arrays
> should be substantially faster than a qemu block job on the compute
> host.
> It is entirely up to cinder to ensure this operation is safe wrt
> multiattach. I expect 'safe' here to mean it will refuse to do the
> operation.
> There may also be details to be worked out wrt who creates the dst
> attachment. If at all possible, I don't want to expose to nova the
> quirk that under cinder's covers a new volume is created which
> temporarily has the wrong volume id. I'd prefer that, from nova's POV,
> the operation is just a managed switch between 2 attachments to
> apparently the same volume.
> 2 possible implementations on the Nova side:
> 1. Hypervisor does not support live storage migration
> Summary: instance must be shutdown.
> volume-migration-start() returns success iff the instance is shutdown.
> Nova will set a task state such that instance cannot be started until
> cancel or complete.
> volume-migration-complete() reconfigures the specific volume
> attachment and unsets the task state so the instance can be started.
> 2. libvirt driver
> Summary: libvirt allows writes to continue during the copy by sending
> them to a local snapshot, which it commits to dst on complete().
> volume-migration-start() sets task state such that instance cannot be
> started or stopped until cancel or completion.
> If instance is not running, does nothing else.
> If instance is running, creates local writeable qcow2 backed by
> read-only src volume. Swaps volume to local qcow2.
> volume-migration-cancel() does nothing if instance is not running
> except unset task state. If instance is running returns success to
> cinder immediately before commiting qcow2 to src volume and then
> unsetting task state.
> volume-migration-complete() updates instance to dst if it's not
> running, unsets task state and returns immediately. If instance is
> running, rebases local qcow2 from src to dst (cinder guarantees us
> that these are identical), then returns success to cinder because src
> can now be released. Nova starts a job to commit qcow2 to dst. On
> completion it updates instance to use dst and unsets task state.
> This API does not handle the case where the hypervisor is only capable
> of live storage migration where it manages the entire copy itself. If
> we had a requirement to support this we might change
> volume-migration-start() to return something indicating that nova will
> do the copy, or perhaps indicating that cinder should try again with a
> different entry point. However, I believe this approach will always be
> slower than a cinder-native copy, and I can't see any way around it
> requiring a callback to cinder on completion of the copy. I'd very
> much like some feedback from non-libvirt hypervisor folks around
> whether they're interested in this functionality at all, and if so
> what their capabilities are.
> Matt
> --
> Matthew Booth
> Red Hat OpenStack Engineer, Compute DFG
> Phone: +442070094448 (UK)

Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

More information about the openstack-discuss mailing list