At the Train PTG people said it would be useful to have a review guide in the ML like what gibi did for the bandwidth provider series in Stein, so this is my attempt. First let me start by saying if I were you, I'd read the spec [1] and watch the presentation from the summit [2] since those should give you the high-level overview of what's going on. **Notes** * The current bottom of the series starts here [3]. That bottom change is actually a bug fix for a long-standing latent race issue in the resource tracker and a functional recreate test patch has already been merged. I ran into this during my functional testing later in the series when auditing resource tracker stuff. * I'm pushing as much non-cross-cell related stuff to the bottom of the series as possible, like refactoring utility methods and test fixture enhancements, and some bug fixes. For anything that would otherwise be unused outside of the series I don't move to the bottom. * The rest of the series works from the bottom up in that nothing "turns it all on" externally in the API until the end of the series. This means we can land code as we go with little risk of impacting existing behavior. Even then the new policy rule that enables this is disabled for all users by default. * The instance task_state flow, instance action records and notifications should be consistent to normal resize throughout to make sure it is transparent to the end user. * We do hard deletes of database records on rollback, confirm and revert. We have a copy of the instance in each DB while resized, but need to hard delete the one we aren't going to keep at our terminal state (otherwise you can't try to resize the server back to the source cell without archiving that DB). * Rather than have the computes RPC cast back and forth to each other, conductor will be orchestrating the compute service work using synchronous RPC calls with the long_rpc_timeout option. This is because we assume the computes can't communicate with each other over RPC, SSH or shared storage. **Testing** * About halfway through the series [4], once the code is plumbed to get the server to VERIFY_RESIZE state, I start a series of functional tests because for a lot of this I'm trying to rely on functional rather than unit tests. I do have unit tests in the majority of the patches but not all of them toward the end because frankly I don't want to spend a lot of time writing unit tests before the code is reviewed - the functional tests are much better at flushing out issues. * At the very end of the series I have a patch [5] which enables cross-cell resize and cold migration using the new nova-multi-cell job. I'm easing into this by only enabling a few tests at a time, flushing issues, and iterating that way since it makes debugging one failing test rather than 10 much easier (to trace the logs and such). So far I've gotten resize + confirm, cold migrate + confirm, and volume-backed resize + confirm passing in that job. I'm currently working on a cold migrate + revert test. **The main idea** We want to keep the new cross-cell resize flow as similar to traditional resize as possible, so you'll see a similar flow of: a) prep for resize on dest host This is just a resource claim and to create the migration context in the target cell DB. But otherwise it's very similar to prep_resize. b) resize from source host Power off the guest, disconnect volumes and interfaces from the source host, and do disk stuff (in this case create a temporary snapshot of the root disk for a non-volume-backed server). c) finish on the dest host Connect volumes, plug VIFs, and spawn a guest (using the snapshot created for a non-volume-backed server). Super easy right?! **Conductor** The big change with cross-cell resize is that (super)conductor is going to be doing the heavy lifting of orchestrating the resize across the two cells. For this I've created a main task (CrossCellMigrationTask) which then drives the sequential execution of sub-tasks, which for the most part mirror the steps above for the operations in the compute services. Also note that each task has a rollback method and as such if any task fails, we run the rollback of all the other tasks back to the beginning which means each task can just care about cleaning up anything it needs to rather than a giant single try/except in the main task. The first thing we have to do is copy the instance and its related records to the target cell database (TargetDBSetupTask). The copy of the instance created in the target cell DB will be marked with a new "hidden" field so that when listing servers in the API we'll filter it out. For the DB copy stuff there needed to be the DB schema change for that hidden column and then just some new versioned object DB API type methods to aid with the data transfer. After that we run the sub-tasks that mirror the compute service operations (prep on dest, resize from source, finish on dest). In between the compute services doing stuff we sometimes need to copy data around, e.g. after PrepResizeAtDestTask we need to copy the migration context from the target cell DB to the source cell DB so the API can handle the network-vif-plugged even when the guest is spawned on the dest host. At the very end once we've finished on the dest host, we update the instance mapping in the API DB to point at the target cell and swap the hidden values on the source/target cell instance so when listing servers we'll be showing the target cell instance in VERIFY_RESIZE state. **Scheduler** All move operations are currently restricted to the same cell (there is code in conductor that does this). If the request passes the new policy check, the RequestSpec is modified so the scheduler will not restrict to just hosts in the source cell. There is a new CrossCellWeigher added which by default will prefer hosts in the same cell if there is a choice, but there might not be a choice, e.g. an admin could target a host in another cell for a cold migration. Also, the weigher can be configured to prefer to move instances to other cells or be overridden on a per-aggregate basis. I talk about this in the video a bit, but the idea is leave the default to prefer hosts in the source cell but override via aggregates to drain instances out of old cells using the weigher. Also, a reminder that when the scheduler picks a primary host, any alternates are in the same cell. This isn't new but is good to remember. **Operation flow** 1. The API will check the new policy rule and if the request passes the policy check, the RequestSpec is modified to say cross-cell migration is OK (conductor and scheduler will look for this) and RPC cast to conductor (rather than RPC call like it does today). 2. The MigrationTask in conductor will RPC call the scheduler's select_destinations method as normal. a) If the selected target host is in another cell, kick off the CrossCellMigrationTask. b) If the selected target host is in the same cell as the instance, we do traditional resize as normal. 3. CrossCellMigrationTask will check that neutron and cinder are new enough for port bindings and volume attachment resources and then execute TargetDBSetupTask. 4. TargetDBSetupTask does the DB copy stuff mentioned above. 5. CrossCellMigrationTask then executes PrepResizeAtDestTask which will create (inactive) port bindings for the target host and empty volume attachments, then RPC call the dest compute to do the resize claim (this is needed for instances with PCI devices and NUMA topology). a) If all of that works, cool, continue. b) If any of that fails, cleanup and try an alternate host (note the alternate host processing is not implemented yet but shouldn't be hard). 6. Next PrepResizeAtSourceTask gets executed which creates a snapshot image for a non-volume-backed server, powers off the guest (but does not destroy the root disks), disconnects volumes and interfaces. At this step on the source compute we also activate the inactive dest host port bindings created earlier. 7. Next FinishResizeAtDestTask calls the dest compute to setup networking on the dest host (most just for PCI stuff), connect volumes, and spawn the guest (from root volume or temp snapshot). Once that is done the task will swap the hidden value on the instances and update the instance mapping in the API DB. At this point the instance is in VERIFY_RESIZE status. **Confirm/Revert** Confirm and revert are implemented similarly with new conductor tasks. Confirm is pretty simple in that it just cleans up the guest from the source compute and deletes the records from the source cell DB. Revert is a bit more work since we have to cleanup the dest compute host, adjust BDMs in the source cell DB if any volumes were attached to or detached from the instance while it was in VERIFY_RESIZE state (yes the API allows that), spawn the guest in the source compute and then twiddle the DB bits so the instance mapping is pointing at the source cell and the source instance is no longer hidden, and then cleanup the target cell DB. **Patch series structure** As noted I'm writing this from the bottom up, and even within sub-operations I'm following a pattern of (1) write the compute methods first and then (2) write the conductor task that will use them and integrate it into the main CrossCellMigrationTask. This means there will be multiple compute RPC API version bumps as new methods get added. The scheduler stuff and some API things happen before the functional testing starts so I can actually start functional testing with stubbing out as little as possible (I have to stub out the policy check in the functional test long before it's enabled in the API at the end of the series). The CrossCellWeigher comes late in the series because by default all in-tree weighers are enabled so I don't want to land that and have it running if there is no chance it will be used yet. It would be a no-op but still not much point in landing that early. Also towards the end of the series is the API code to handle routing external events to multiple cells [6]. That's actually pretty lightweight and we could move it up to merge earlier, it's just not really something that benefits the in-tree functional tests (but is required to make the tempest tests work). **Hairy things** * The patch that adds the Instance.hidden field [7] is a bit complicated just because we have to account for it in a few places (not only listing instances but counting quotas). I have functional testing for all of that though to make people feel better. * The DB copy stuff in the TargetDBSetupTask is not really complicated but there are a lot of instance-related records to account for so I've tried to do my best there and written a pretty extensive set of tests for it. There are things we don't copy over like pci_devices and services since those are only relevant to other resources, like the compute node, in the source cell DB. Also, I messed around with doing the inserts in a single transaction like was done here [8] but couldn't really get it to work so I just left a TODO for now (maybe melwitt can figure it out). The single transaction insert is mostly low priority for me since if anything in there fails we hard destroy the target cell instance which will cascade delete any related records in the target cell DB. * The functional tests themselves can be a little daunting at first but I've tried to fully document those to make it clear what is being setup, what is happening, and what is being asserted. There are still TODOs for more testing to be done as I get more review (and time). * I have not written a docs patch yet since it could change as the series starts getting reviewed, and there is likely a lot to document (the spec is big, this email is big). When I do I will likely try to add a sequence diagram as well (and write one for normal resize since we don't have one of those in our docs today - we can do that whenever). If you have questions please reply to this email, ping me on IRC, or just ask in the code review - if you find I'm not replying in the review and rebasing over your comment, drop a -1 to make me notice (I'm currently up to 43 open changes so noticing random comments in a change in the middle is not obvious without a -1). If you made it this far, thanks, and remember that you asked for this review guide. :) [1] https://specs.openstack.org/openstack/nova-specs/specs/train/approved/cross-... [2] https://youtu.be/OyNFIOSGjac?t=143 [3] https://review.opendev.org/#/c/641806/ [4] https://review.opendev.org/#/c/636253/ [5] https://review.opendev.org/#/c/656656/ [6] https://review.opendev.org/#/c/658478/ [7] https://review.opendev.org/#/c/631123/ [8] https://review.opendev.org/#/c/586742/ -- Thanks, Matt