Before [1] when deleting a compute service in the API we did not check to see if the compute service was hosting any instances and just blindly deleted the service and related compute_node(s) records which orphaned the resource provider(s) for those nodes. With [2] we built on that and would cleanup the (first [3]) compute node resource provider by first deleting any allocations for instances still on that host - which because of the check in [1] should be none - and then deleted the resource provider itself. [2] forgot about ironic where a single compute service can be managing multiple (hundreds or even thousands) of baremetal compute nodes so I wrote [3] to delete *all* resource providers for compute nodes tied to the service - again barring there being any instances running on the service because of the check added in [1]. What we've failed to realize until recently is that there are cases where deleting the resource provider can still fail because there are allocations we haven't cleaned up, namely: 1. Residual allocations for evacuated instances from a source host. 2. Allocations held by a migration record for an unconfirmed (or not yet complete) migration. Because the delete_resource_provider method isn't checking for those, we can get ResourceProviderInUse errors which are then ignored [4]. Since that error is ignored, we continue on to delete the compute service record [5], effectively orphaning the providers (which is what [2] was meant to fix). I have recreated the evacuate scenario in a functional test here [6]. The question is what should we do about the fix? I'm getting lost thinking about this in a vacuum so trying to get some others to help think about it. Clearly with [1] we said you shouldn't be able to delete a compute service that has instances on it because that corrupts our resource tracking system. If we extend that to any allocations held against providers for that compute service, then the fix might be as simple as not ignoring the ResourceProviderInUse error and fail if we can't delete the provider(s). The question I'm struggling with is what does an operator do for the two cases mentioned above, not-yet-complete migrations and evacuated instances? For migrations, that seems pretty simple - wait for the migration to complete and confirm it (reverting a cold migration or resize would put the instance back on the compute service host you're trying to delete). The nastier thing is the allocations tied to an evacuated instance since those don't get cleaned up until the compute service is restarted [7]. If the operator never intends on restarting that compute service and just wants to clear the data, then they have to manually delete the allocations for the resource providers associated with that host before they can delete the compute service, which kind of sucks. What are our options? 1. Don't delete the compute service if we can't cleanup all resource providers - make sure to not orphan any providers. Manual cleanup may be necessary by the operator. 2. Change delete_resource_provider cascade=True logic to remove all allocations for the provider before deleting it, i.e. for not-yet-complete migrations and evacuated instances. For the evacuated instance allocations this is likely OK since restarting the source compute service is going to do that cleanup anyway. Also, if you delete the source compute service during a migration, confirming or reverting the resize later will likely fail since we'd be casting to something that is gone (and we'd orphan those allocations). Maybe we need a functional recreate test for the unconfirmed migration scenario before deciding on this? 3. Other things I'm not thinking of? Should we add a force parameter to the API to allow the operator to forcefully delete (#2 above) if #1 fails? Force parameters are hacky and usually seem to cause more problems than they solve, but it does put the control in the operators hands. If we did remove allocations for an instance when deleting it's compute service host, the operator should be able to get them back by running the "nova-manage placement heal_allocations" CLI - assuming they restart the compute service on that host. This would have to be tested of course. Help me Obi-Wan Kenobi. You're my only hope. [1] https://review.opendev.org/#/q/I0bd63b655ad3d3d39af8d15c781ce0a45efc8e3a [2] https://review.opendev.org/#/q/I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 [3] https://review.opendev.org/#/c/657016/ [4] https://github.com/openstack/nova/blob/cb0cfc90e1e03e82c42187ec60f46fb8fd590... [5] https://github.com/openstack/nova/blob/cb0cfc90e1e03e82c42187ec60f46fb8fd590... [6] https://review.opendev.org/#/c/663737/ [7] https://github.com/openstack/nova/blob/cb0cfc90e1e03e82c42187ec60f46fb8fd590... -- Thanks, Matt