Open Stack

Thu May 16 20:33:39 UTC 2019

At the Train PTG people said it would be useful to have a review guide 
in the ML like what gibi did for the bandwidth provider series in Stein, 
so this is my attempt.

First let me start by saying if I were you, I'd read the spec [1] and 
watch the presentation from the summit [2] since those should give you 
the high-level overview of what's going on.

**Notes**

* The current bottom of the series starts here [3]. That bottom change 
is actually a bug fix for a long-standing latent race issue in the 
resource tracker and a functional recreate test patch has already been 
merged. I ran into this during my functional testing later in the series 
when auditing resource tracker stuff.

* I'm pushing as much non-cross-cell related stuff to the bottom of the 
series as possible, like refactoring utility methods and test fixture 
enhancements, and some bug fixes. For anything that would otherwise be 
unused outside of the series I don't move to the bottom.

* The rest of the series works from the bottom up in that nothing "turns 
it all on" externally in the API until the end of the series. This means 
we can land code as we go with little risk of impacting existing 
behavior. Even then the new policy rule that enables this is disabled 
for all users by default.

* The instance task_state flow, instance action records and 
notifications should be consistent to normal resize throughout to make 
sure it is transparent to the end user.

* We do hard deletes of database records on rollback, confirm and 
revert. We have a copy of the instance in each DB while resized, but 
need to hard delete the one we aren't going to keep at our terminal 
state (otherwise you can't try to resize the server back to the source 
cell without archiving that DB).

* Rather than have the computes RPC cast back and forth to each other, 
conductor will be orchestrating the compute service work using 
synchronous RPC calls with the long_rpc_timeout option. This is because 
we assume the computes can't communicate with each other over RPC, SSH 
or shared storage.

**Testing**

* About halfway through the series [4], once the code is plumbed to get 
the server to VERIFY_RESIZE state, I start a series of functional tests 
because for a lot of this I'm trying to rely on functional rather than 
unit tests. I do have unit tests in the majority of the patches but not 
all of them toward the end because frankly I don't want to spend a lot 
of time writing unit tests before the code is reviewed - the functional 
tests are much better at flushing out issues.

* At the very end of the series I have a patch [5] which enables 
cross-cell resize and cold migration using the new nova-multi-cell job. 
I'm easing into this by only enabling a few tests at a time, flushing 
issues, and iterating that way since it makes debugging one failing test 
rather than 10 much easier (to trace the logs and such). So far I've 
gotten resize + confirm, cold migrate + confirm, and volume-backed 
resize + confirm passing in that job. I'm currently working on a cold 
migrate + revert test.

**The main idea**

We want to keep the new cross-cell resize flow as similar to traditional 
resize as possible, so you'll see a similar flow of:

a) prep for resize on dest host

This is just a resource claim and to create the migration context in the 
target cell DB. But otherwise it's very similar to prep_resize.

b) resize from source host

Power off the guest, disconnect volumes and interfaces from the source 
host, and do disk stuff (in this case create a temporary snapshot of the 
root disk for a non-volume-backed server).

c) finish on the dest host

Connect volumes, plug VIFs, and spawn a guest (using the snapshot 
created for a non-volume-backed server).

Super easy right?!

**Conductor**

The big change with cross-cell resize is that (super)conductor is going 
to be doing the heavy lifting of orchestrating the resize across the two 
cells. For this I've created a main task (CrossCellMigrationTask) which 
then drives the sequential execution of sub-tasks, which for the most 
part mirror the steps above for the operations in the compute services.

Also note that each task has a rollback method and as such if any task 
fails, we run the rollback of all the other tasks back to the beginning 
which means each task can just care about cleaning up anything it needs 
to rather than a giant single try/except in the main task.

The first thing we have to do is copy the instance and its related 
records to the target cell database (TargetDBSetupTask). The copy of the 
instance created in the target cell DB will be marked with a new 
"hidden" field so that when listing servers in the API we'll filter it 
out. For the DB copy stuff there needed to be the DB schema change for 
that hidden column and then just some new versioned object DB API type 
methods to aid with the data transfer.

After that we run the sub-tasks that mirror the compute service 
operations (prep on dest, resize from source, finish on dest). In 
between the compute services doing stuff we sometimes need to copy data 
around, e.g. after PrepResizeAtDestTask we need to copy the migration 
context from the target cell DB to the source cell DB so the API can 
handle the network-vif-plugged even when the guest is spawned on the 
dest host.

At the very end once we've finished on the dest host, we update the 
instance mapping in the API DB to point at the target cell and swap the 
hidden values on the source/target cell instance so when listing servers 
we'll be showing the target cell instance in VERIFY_RESIZE state.

**Scheduler**

All move operations are currently restricted to the same cell (there is 
code in conductor that does this). If the request passes the new policy 
check, the RequestSpec is modified so the scheduler will not restrict to 
just hosts in the source cell.

There is a new CrossCellWeigher added which by default will prefer hosts 
in the same cell if there is a choice, but there might not be a choice, 
e.g. an admin could target a host in another cell for a cold migration. 
Also, the weigher can be configured to prefer to move instances to other 
cells or be overridden on a per-aggregate basis. I talk about this in 
the video a bit, but the idea is leave the default to prefer hosts in 
the source cell but override via aggregates to drain instances out of 
old cells using the weigher.

Also, a reminder that when the scheduler picks a primary host, any 
alternates are in the same cell. This isn't new but is good to remember.

**Operation flow**

1. The API will check the new policy rule and if the request passes the 
policy check, the RequestSpec is modified to say cross-cell migration is 
OK (conductor and scheduler will look for this) and RPC cast to 
conductor (rather than RPC call like it does today).

2. The MigrationTask in conductor will RPC call the scheduler's 
select_destinations method as normal.

a) If the selected target host is in another cell, kick off the 
CrossCellMigrationTask.

b) If the selected target host is in the same cell as the instance, we 
do traditional resize as normal.

3. CrossCellMigrationTask will check that neutron and cinder are new 
enough for port bindings and volume attachment resources and then 
execute TargetDBSetupTask.

4. TargetDBSetupTask does the DB copy stuff mentioned above.

5. CrossCellMigrationTask then executes PrepResizeAtDestTask which will 
create (inactive) port bindings for the target host and empty volume 
attachments, then RPC call the dest compute to do the resize claim (this 
is needed for instances with PCI devices and NUMA topology).

a) If all of that works, cool, continue.

b) If any of that fails, cleanup and try an alternate host (note the 
alternate host processing is not implemented yet but shouldn't be hard).

6. Next PrepResizeAtSourceTask gets executed which creates a snapshot 
image for a non-volume-backed server, powers off the guest (but does not 
destroy the root disks), disconnects volumes and interfaces. At this 
step on the source compute we also activate the inactive dest host port 
bindings created earlier.

7. Next FinishResizeAtDestTask calls the dest compute to setup 
networking on the dest host (most just for PCI stuff), connect volumes, 
and spawn the guest (from root volume or temp snapshot). Once that is 
done the task will swap the hidden value on the instances and update the 
instance mapping in the API DB. At this point the instance is in 
VERIFY_RESIZE status.

**Confirm/Revert**

Confirm and revert are implemented similarly with new conductor tasks. 
Confirm is pretty simple in that it just cleans up the guest from the 
source compute and deletes the records from the source cell DB.

Revert is a bit more work since we have to cleanup the dest compute 
host, adjust BDMs in the source cell DB if any volumes were attached to 
or detached from the instance while it was in VERIFY_RESIZE state (yes 
the API allows that), spawn the guest in the source compute and then 
twiddle the DB bits so the instance mapping is pointing at the source 
cell and the source instance is no longer hidden, and then cleanup the 
target cell DB.

**Patch series structure**

As noted I'm writing this from the bottom up, and even within 
sub-operations I'm following a pattern of (1) write the compute methods 
first and then (2) write the conductor task that will use them and 
integrate it into the main CrossCellMigrationTask. This means there will 
be multiple compute RPC API version bumps as new methods get added.

The scheduler stuff and some API things happen before the functional 
testing starts so I can actually start functional testing with stubbing 
out as little as possible (I have to stub out the policy check in the 
functional test long before it's enabled in the API at the end of the 
series).

The CrossCellWeigher comes late in the series because by default all 
in-tree weighers are enabled so I don't want to land that and have it 
running if there is no chance it will be used yet. It would be a no-op 
but still not much point in landing that early.

Also towards the end of the series is the API code to handle routing 
external events to multiple cells [6]. That's actually pretty 
lightweight and we could move it up to merge earlier, it's just not 
really something that benefits the in-tree functional tests (but is 
required to make the tempest tests work).

**Hairy things**

* The patch that adds the Instance.hidden field [7] is a bit complicated 
just because we have to account for it in a few places (not only listing 
instances but counting quotas). I have functional testing for all of 
that though to make people feel better.

* The DB copy stuff in the TargetDBSetupTask is not really complicated 
but there are a lot of instance-related records to account for so I've 
tried to do my best there and written a pretty extensive set of tests 
for it. There are things we don't copy over like pci_devices and 
services since those are only relevant to other resources, like the 
compute node, in the source cell DB. Also, I messed around with doing 
the inserts in a single transaction like was done here [8] but couldn't 
really get it to work so I just left a TODO for now (maybe melwitt can 
figure it out). The single transaction insert is mostly low priority for 
me since if anything in there fails we hard destroy the target cell 
instance which will cascade delete any related records in the target 
cell DB.

* The functional tests themselves can be a little daunting at first but 
I've tried to fully document those to make it clear what is being setup, 
what is happening, and what is being asserted. There are still TODOs for 
more testing to be done as I get more review (and time).

* I have not written a docs patch yet since it could change as the 
series starts getting reviewed, and there is likely a lot to document 
(the spec is big, this email is big). When I do I will likely try to add 
a sequence diagram as well (and write one for normal resize since we 
don't have one of those in our docs today - we can do that whenever).

If you have questions please reply to this email, ping me on IRC, or 
just ask in the code review - if you find I'm not replying in the review 
and rebasing over your comment, drop a -1 to make me notice (I'm 
currently up to 43 open changes so noticing random comments in a change 
in the middle is not obvious without a -1).

If you made it this far, thanks, and remember that you asked for this 
review guide. :)

[1] 
https://specs.openstack.org/openstack/nova-specs/specs/train/approved/cross-cell-resize.html
[2] https://youtu.be/OyNFIOSGjac?t=143
[3] https://review.opendev.org/#/c/641806/
[4] https://review.opendev.org/#/c/636253/
[5] https://review.opendev.org/#/c/656656/
[6] https://review.opendev.org/#/c/658478/
[7] https://review.opendev.org/#/c/631123/
[8] https://review.opendev.org/#/c/586742/

-- 

Thanks,

Matt

Open Stack

[nova] cross-cell resize review guide

OpenStack

Community

Documentation

Branding & Legal