[nova] cross-cell resize review guide
mriedemos at gmail.com
Thu May 16 20:33:39 UTC 2019
At the Train PTG people said it would be useful to have a review guide
in the ML like what gibi did for the bandwidth provider series in Stein,
so this is my attempt.
First let me start by saying if I were you, I'd read the spec  and
watch the presentation from the summit  since those should give you
the high-level overview of what's going on.
* The current bottom of the series starts here . That bottom change
is actually a bug fix for a long-standing latent race issue in the
resource tracker and a functional recreate test patch has already been
merged. I ran into this during my functional testing later in the series
when auditing resource tracker stuff.
* I'm pushing as much non-cross-cell related stuff to the bottom of the
series as possible, like refactoring utility methods and test fixture
enhancements, and some bug fixes. For anything that would otherwise be
unused outside of the series I don't move to the bottom.
* The rest of the series works from the bottom up in that nothing "turns
it all on" externally in the API until the end of the series. This means
we can land code as we go with little risk of impacting existing
behavior. Even then the new policy rule that enables this is disabled
for all users by default.
* The instance task_state flow, instance action records and
notifications should be consistent to normal resize throughout to make
sure it is transparent to the end user.
* We do hard deletes of database records on rollback, confirm and
revert. We have a copy of the instance in each DB while resized, but
need to hard delete the one we aren't going to keep at our terminal
state (otherwise you can't try to resize the server back to the source
cell without archiving that DB).
* Rather than have the computes RPC cast back and forth to each other,
conductor will be orchestrating the compute service work using
synchronous RPC calls with the long_rpc_timeout option. This is because
we assume the computes can't communicate with each other over RPC, SSH
or shared storage.
* About halfway through the series , once the code is plumbed to get
the server to VERIFY_RESIZE state, I start a series of functional tests
because for a lot of this I'm trying to rely on functional rather than
unit tests. I do have unit tests in the majority of the patches but not
all of them toward the end because frankly I don't want to spend a lot
of time writing unit tests before the code is reviewed - the functional
tests are much better at flushing out issues.
* At the very end of the series I have a patch  which enables
cross-cell resize and cold migration using the new nova-multi-cell job.
I'm easing into this by only enabling a few tests at a time, flushing
issues, and iterating that way since it makes debugging one failing test
rather than 10 much easier (to trace the logs and such). So far I've
gotten resize + confirm, cold migrate + confirm, and volume-backed
resize + confirm passing in that job. I'm currently working on a cold
migrate + revert test.
**The main idea**
We want to keep the new cross-cell resize flow as similar to traditional
resize as possible, so you'll see a similar flow of:
a) prep for resize on dest host
This is just a resource claim and to create the migration context in the
target cell DB. But otherwise it's very similar to prep_resize.
b) resize from source host
Power off the guest, disconnect volumes and interfaces from the source
host, and do disk stuff (in this case create a temporary snapshot of the
root disk for a non-volume-backed server).
c) finish on the dest host
Connect volumes, plug VIFs, and spawn a guest (using the snapshot
created for a non-volume-backed server).
Super easy right?!
The big change with cross-cell resize is that (super)conductor is going
to be doing the heavy lifting of orchestrating the resize across the two
cells. For this I've created a main task (CrossCellMigrationTask) which
then drives the sequential execution of sub-tasks, which for the most
part mirror the steps above for the operations in the compute services.
Also note that each task has a rollback method and as such if any task
fails, we run the rollback of all the other tasks back to the beginning
which means each task can just care about cleaning up anything it needs
to rather than a giant single try/except in the main task.
The first thing we have to do is copy the instance and its related
records to the target cell database (TargetDBSetupTask). The copy of the
instance created in the target cell DB will be marked with a new
"hidden" field so that when listing servers in the API we'll filter it
out. For the DB copy stuff there needed to be the DB schema change for
that hidden column and then just some new versioned object DB API type
methods to aid with the data transfer.
After that we run the sub-tasks that mirror the compute service
operations (prep on dest, resize from source, finish on dest). In
between the compute services doing stuff we sometimes need to copy data
around, e.g. after PrepResizeAtDestTask we need to copy the migration
context from the target cell DB to the source cell DB so the API can
handle the network-vif-plugged even when the guest is spawned on the
At the very end once we've finished on the dest host, we update the
instance mapping in the API DB to point at the target cell and swap the
hidden values on the source/target cell instance so when listing servers
we'll be showing the target cell instance in VERIFY_RESIZE state.
All move operations are currently restricted to the same cell (there is
code in conductor that does this). If the request passes the new policy
check, the RequestSpec is modified so the scheduler will not restrict to
just hosts in the source cell.
There is a new CrossCellWeigher added which by default will prefer hosts
in the same cell if there is a choice, but there might not be a choice,
e.g. an admin could target a host in another cell for a cold migration.
Also, the weigher can be configured to prefer to move instances to other
cells or be overridden on a per-aggregate basis. I talk about this in
the video a bit, but the idea is leave the default to prefer hosts in
the source cell but override via aggregates to drain instances out of
old cells using the weigher.
Also, a reminder that when the scheduler picks a primary host, any
alternates are in the same cell. This isn't new but is good to remember.
1. The API will check the new policy rule and if the request passes the
policy check, the RequestSpec is modified to say cross-cell migration is
OK (conductor and scheduler will look for this) and RPC cast to
conductor (rather than RPC call like it does today).
2. The MigrationTask in conductor will RPC call the scheduler's
select_destinations method as normal.
a) If the selected target host is in another cell, kick off the
b) If the selected target host is in the same cell as the instance, we
do traditional resize as normal.
3. CrossCellMigrationTask will check that neutron and cinder are new
enough for port bindings and volume attachment resources and then
4. TargetDBSetupTask does the DB copy stuff mentioned above.
5. CrossCellMigrationTask then executes PrepResizeAtDestTask which will
create (inactive) port bindings for the target host and empty volume
attachments, then RPC call the dest compute to do the resize claim (this
is needed for instances with PCI devices and NUMA topology).
a) If all of that works, cool, continue.
b) If any of that fails, cleanup and try an alternate host (note the
alternate host processing is not implemented yet but shouldn't be hard).
6. Next PrepResizeAtSourceTask gets executed which creates a snapshot
image for a non-volume-backed server, powers off the guest (but does not
destroy the root disks), disconnects volumes and interfaces. At this
step on the source compute we also activate the inactive dest host port
bindings created earlier.
7. Next FinishResizeAtDestTask calls the dest compute to setup
networking on the dest host (most just for PCI stuff), connect volumes,
and spawn the guest (from root volume or temp snapshot). Once that is
done the task will swap the hidden value on the instances and update the
instance mapping in the API DB. At this point the instance is in
Confirm and revert are implemented similarly with new conductor tasks.
Confirm is pretty simple in that it just cleans up the guest from the
source compute and deletes the records from the source cell DB.
Revert is a bit more work since we have to cleanup the dest compute
host, adjust BDMs in the source cell DB if any volumes were attached to
or detached from the instance while it was in VERIFY_RESIZE state (yes
the API allows that), spawn the guest in the source compute and then
twiddle the DB bits so the instance mapping is pointing at the source
cell and the source instance is no longer hidden, and then cleanup the
target cell DB.
**Patch series structure**
As noted I'm writing this from the bottom up, and even within
sub-operations I'm following a pattern of (1) write the compute methods
first and then (2) write the conductor task that will use them and
integrate it into the main CrossCellMigrationTask. This means there will
be multiple compute RPC API version bumps as new methods get added.
The scheduler stuff and some API things happen before the functional
testing starts so I can actually start functional testing with stubbing
out as little as possible (I have to stub out the policy check in the
functional test long before it's enabled in the API at the end of the
The CrossCellWeigher comes late in the series because by default all
in-tree weighers are enabled so I don't want to land that and have it
running if there is no chance it will be used yet. It would be a no-op
but still not much point in landing that early.
Also towards the end of the series is the API code to handle routing
external events to multiple cells . That's actually pretty
lightweight and we could move it up to merge earlier, it's just not
really something that benefits the in-tree functional tests (but is
required to make the tempest tests work).
* The patch that adds the Instance.hidden field  is a bit complicated
just because we have to account for it in a few places (not only listing
instances but counting quotas). I have functional testing for all of
that though to make people feel better.
* The DB copy stuff in the TargetDBSetupTask is not really complicated
but there are a lot of instance-related records to account for so I've
tried to do my best there and written a pretty extensive set of tests
for it. There are things we don't copy over like pci_devices and
services since those are only relevant to other resources, like the
compute node, in the source cell DB. Also, I messed around with doing
the inserts in a single transaction like was done here  but couldn't
really get it to work so I just left a TODO for now (maybe melwitt can
figure it out). The single transaction insert is mostly low priority for
me since if anything in there fails we hard destroy the target cell
instance which will cascade delete any related records in the target
* The functional tests themselves can be a little daunting at first but
I've tried to fully document those to make it clear what is being setup,
what is happening, and what is being asserted. There are still TODOs for
more testing to be done as I get more review (and time).
* I have not written a docs patch yet since it could change as the
series starts getting reviewed, and there is likely a lot to document
(the spec is big, this email is big). When I do I will likely try to add
a sequence diagram as well (and write one for normal resize since we
don't have one of those in our docs today - we can do that whenever).
If you have questions please reply to this email, ping me on IRC, or
just ask in the code review - if you find I'm not replying in the review
and rebasing over your comment, drop a -1 to make me notice (I'm
currently up to 43 open changes so noticing random comments in a change
in the middle is not obvious without a -1).
If you made it this far, thanks, and remember that you asked for this
review guide. :)
More information about the openstack-discuss