[Openstack-operators] [nova][cinder][neutron] Cross-cell cold migration

Dan Smith dms at danplanet.com
Wed Aug 29 16:39:51 UTC 2018


> A release upgrade dance involves coordination of multiple moving
> parts. It's about as similar to this scenario as I can imagine. And
> there's a reason release upgrades are not done entirely within Nova;
> clearly an external upgrade tool or script is needed to orchestrate
> the many steps and components involved in the upgrade process.

I'm lost here, and assume we must be confusing terminology or something.

> The similar dance for cross-cell migration is the coordination that
> needs to happen between Nova, Neutron and Cinder. It's called
> orchestration for a reason and is not what Nova is good at (as we've
> repeatedly seen)

Most other operations in Nova meet this criteria. Boot requires
coordination between Nova, Cinder, and Neutron. As do migrate, start,
stop, evacuate. We might decide that (for now) the volume migration
thing is beyond the line we're willing to cross, and that's cool, but I
think it's an arbitrary limitation we shouldn't assume is
impossible. Moving instances around *is* what nova is (supposed to be)
good at.

> The thing that makes *this* particular scenario problematic is that
> cells aren't user-visible things. User-visible things could much more
> easily be orchestrated via external actors, as I still firmly believe
> this kind of thing should be done.

I'm having a hard time reconciling these:

1. Cells aren't user-visible, and shouldn't be (your words and mine).
2. Cross-cell migration should be done by an external service (your
   words).
3. External services work best when things are user-visible (your words).

You say the user-invisible-ness makes orchestrating this externally
difficult and I agree, but...is your argument here just that it
shouldn't be done at all?

>> As we discussed in YVR most recently, it also may become an important
>> thing for operators and users where expensive accelerators are committed
>> to instances with part-time usage patterns.
>
> I don't think that's a valid use case in respect to this scenario of
> cross-cell migration.

You're right, it has nothing to do with cross-cell migration at all. I
was pointing to *other* legitimate use cases for shelve.

> Also, I'd love to hear from anyone in the real world who has
> successfully migrated (live or otherwise) an instance that "owns"
> expensive hardware (accelerators, SR-IOV PFs, GPUs or otherwise).

Again, the accelerator case has nothing to do with migrating across
cells, but merely demonstrates another example of where shelve may be
the thing operators actually desire. Maybe I shouldn't have confused the
discussion by bringing it up.

> The patterns that I have seen are one of the following:
>
> * Applications don't move. They are pets that stay on one or more VMs
> or baremetal nodes and they grow roots.
>
> * Applications are designed to *utilize* the expensive hardware. They
> don't "own" the hardware itself.
>
> In this latter case, the application is properly designed and stores
> its persistent data in a volume and doesn't keep state outside of the
> application volume. In these cases, the process of "migrating" an
> instance simply goes away. You just detach the application persistent
> volume, shut down the instance, start up a new one elsewhere (allowing
> the scheduler to select one that meets the resource constraints in the
> flavor/image), attach the volume again and off you go. No messing
> around with shelving, offloading, migrating, or any of that nonsense
> in Nova.

Jay, you know I sympathize with the fully-ephemeral application case,
right? Can we agree that pets are a thing and that migrations are not
going to be leaving Nova's scope any time soon? If so, I think we can
get back to the real discussion, and if not, I think we probably, er,
can't :)

> We should not pretend that what we're discussing here is anything
> other than hacking orchestration workarounds into Nova to handle
> poorly-designed applications that have grown roots on some hardware
> and think they "own" hardware resources in a Nova deployment.

I have no idea how we got to "own hardware resources" here. The point of
this discussion is to make our instance-moving operations work across
cells. We designed cellsv2 to be invisible and baked into the core of
Nova. We intended for it to not fall into the trap laid by cellsv1,
where the presence of multiple cells meant that a bunch of regular
operations don't work like they would otherwise.

If we're going to discuss removing move operations from Nova, we should
do that in another thread. This one is about making existing operations
work :)

> If that's the case, why are we discussing shelve at all? Just stop the
> instance, copy/migrate the volume data (if needed, again it completely
> depends on the deployment, network topology and block storage
> backend), to a new location (new cell, new AZ, new host agg, does it
> really matter?) and start a new instance, attaching the volume after
> the instance starts or supplying the volume in the boot/create
> command.

Because shelve potentially makes it less dependent on the answers to
those questions and Matt suggested it as a first step to being able to
move things around at all. It means that "copy the data" becomes "talk
to glance" which compute nodes can already do. Requiring compute nodes
across cells to talk to each other (which could be in different
buildings, sites, or security domains) is a whole extra layer of
complexity. I do think we'll go there (via resize/migrate at some point,
but shelve going through glance for data and through a homeless phase in
Nova does simplify a whole set of things.

> The admin only "owns" the instance because we have no ability to
> transfer ownership of the instance and a cell isn't a user-visible
> thing. An external script that accomplishes this kind of orchestrated
> move from one cell to another could easily update the ownership of
> said instance in the DB.

So step 5 was "do surgery on the database"? :)

> My point is that Nova isn't an orchestrator, and building
> functionality into Nova to do this type of cross-cell migration IMHO
> just will lead to even more unmaintainable code paths that few, if
> any, deployers will ever end up using because they will end up doing
> it externally anyway due to the need to integrate with backend
> inventory management systems and other things.

On the contrary, per the original goal of cellsv2, I want to make the
*existing* code paths in Nova work properly when multiple cells are
present. Just like we had to make boot and list work properly with
multiple cells, I think we need to do the same with migrate, shelve,
etc.

--Dan



More information about the OpenStack-operators mailing list