[nova][ops] What should the compute service delete behavior be wrt resource providers with allocations?

12 Jun 2019

      Before [1] when deleting a compute service in the API we did not check 
to see if the compute service was hosting any instances and just blindly 
deleted the service and related compute_node(s) records which orphaned 
the resource provider(s) for those nodes.

With [2] we built on that and would cleanup the (first [3]) compute node 
resource provider by first deleting any allocations for instances still 
on that host - which because of the check in [1] should be none - and 
then deleted the resource provider itself.

[2] forgot about ironic where a single compute service can be managing 
multiple (hundreds or even thousands) of baremetal compute nodes so I 
wrote [3] to delete *all* resource providers for compute nodes tied to 
the service - again barring there being any instances running on the 
service because of the check added in [1].

What we've failed to realize until recently is that there are cases 
where deleting the resource provider can still fail because there are 
allocations we haven't cleaned up, namely:

1. Residual allocations for evacuated instances from a source host.

2. Allocations held by a migration record for an unconfirmed (or not yet 
complete) migration.

Because the delete_resource_provider method isn't checking for those, we 
can get ResourceProviderInUse errors which are then ignored [4]. Since 
that error is ignored, we continue on to delete the compute service 
record [5], effectively orphaning the providers (which is what [2] was 
meant to fix). I have recreated the evacuate scenario in a functional 
test here [6].

The question is what should we do about the fix? I'm getting lost 
thinking about this in a vacuum so trying to get some others to help 
think about it.

Clearly with [1] we said you shouldn't be able to delete a compute 
service that has instances on it because that corrupts our resource 
tracking system. If we extend that to any allocations held against 
providers for that compute service, then the fix might be as simple as 
not ignoring the ResourceProviderInUse error and fail if we can't delete 
the provider(s).

The question I'm struggling with is what does an operator do for the two 
cases mentioned above, not-yet-complete migrations and evacuated 
instances? For migrations, that seems pretty simple - wait for the 
migration to complete and confirm it (reverting a cold migration or 
resize would put the instance back on the compute service host you're 
trying to delete).

The nastier thing is the allocations tied to an evacuated instance since 
those don't get cleaned up until the compute service is restarted [7]. 
If the operator never intends on restarting that compute service and 
just wants to clear the data, then they have to manually delete the 
allocations for the resource providers associated with that host before 
they can delete the compute service, which kind of sucks.

What are our options?

1. Don't delete the compute service if we can't cleanup all resource 
providers - make sure to not orphan any providers. Manual cleanup may be 
necessary by the operator.

2. Change delete_resource_provider cascade=True logic to remove all 
allocations for the provider before deleting it, i.e. for 
not-yet-complete migrations and evacuated instances. For the evacuated 
instance allocations this is likely OK since restarting the source 
compute service is going to do that cleanup anyway. Also, if you delete 
the source compute service during a migration, confirming or reverting 
the resize later will likely fail since we'd be casting to something 
that is gone (and we'd orphan those allocations). Maybe we need a 
functional recreate test for the unconfirmed migration scenario before 
deciding on this?

3. Other things I'm not thinking of? Should we add a force parameter to 
the API to allow the operator to forcefully delete (#2 above) if #1 
fails? Force parameters are hacky and usually seem to cause more 
problems than they solve, but it does put the control in the operators 
hands.

If we did remove allocations for an instance when deleting it's compute 
service host, the operator should be able to get them back by running 
the "nova-manage placement heal_allocations" CLI - assuming they restart 
the compute service on that host. This would have to be tested of course.

Help me Obi-Wan Kenobi. You're my only hope.

[1] https://review.opendev.org/#/q/I0bd63b655ad3d3d39af8d15c781ce0a45efc8e3a
[2] https://review.opendev.org/#/q/I7b8622b178d5043ed1556d7bdceaf60f47e5ac80
[3] https://review.opendev.org/#/c/657016/
[4] 
https://github.com/openstack/nova/blob/cb0cfc90e1e03e82c42187ec60f46fb8fd590...
[5] 
https://github.com/openstack/nova/blob/cb0cfc90e1e03e82c42187ec60f46fb8fd590...
[6] https://review.opendev.org/#/c/663737/
[7] 
https://github.com/openstack/nova/blob/cb0cfc90e1e03e82c42187ec60f46fb8fd590...

-- 

Thanks,

Matt

[nova][ops] What should the compute service delete behavior be wrt resource providers with allocations?

Matt Riedemann