Re: [nova][ops] What should the compute service delete behavior be wrt resource providers with allocations?

13 Jun 2019

      On 6/12/2019 3:38 PM, Matt Riedemann wrote:
...
What are our options?
1. Don't delete the compute service if we can't cleanup all resource 
providers - make sure to not orphan any providers. Manual cleanup may be 
necessary by the operator.
2. Change delete_resource_provider cascade=True logic to remove all 
allocations for the provider before deleting it, i.e. for 
not-yet-complete migrations and evacuated instances. For the evacuated 
instance allocations this is likely OK since restarting the source 
compute service is going to do that cleanup anyway. Also, if you delete 
the source compute service during a migration, confirming or reverting 
the resize later will likely fail since we'd be casting to something 
that is gone (and we'd orphan those allocations). Maybe we need a 
functional recreate test for the unconfirmed migration scenario before 
deciding on this?
3. Other things I'm not thinking of? Should we add a force parameter to 
the API to allow the operator to forcefully delete (#2 above) if #1 
fails? Force parameters are hacky and usually seem to cause more 
problems than they solve, but it does put the control in the operators 
hands.
If we did remove allocations for an instance when deleting it's compute 
service host, the operator should be able to get them back by running 
the "nova-manage placement heal_allocations" CLI - assuming they restart 
the compute service on that host. This would have to be tested of course.
After talking a bit about this in IRC today, I'm thinking about a phased 
approach to this problem with these changes in order:

1. Land [1] so we're at least trying to cleanup all providers for a 
given compute service (the ironic case).

2. Implement option #1 above where we fail to delete the compute service 
if any of the resource providers cannot be deleted. We'd have stuff in 
the logs about completing migrations and trying again, and failing that 
cleanup allocations for old evacuations. Rather than dump all of that 
info into the logs, it would probably be better to just write up a 
troubleshooting doc [2] for it and link to that from the logs, then the 
doc can reference APIs and CLIs to use for the cleanup scenarios.

3. Implement option #2 above where we cleanup allocations but only for 
evacuations - like the compute service would do when it's restarted anyway.

This would leave the case that we don't delete the compute service for 
allocations related to other types of migrations - in-progress or 
unconfirmed (or failed and leaked) migrations that would require 
operator investigation. We could build on that in the future if we 
wanted to toy with the idea of checking the service group API for 
whether or not the service is up or if we wanted to add a force option 
to just tell nova to fully cascade delete everything, but I don't really 
want to get hung up on those edge cases right now.

How do people feel about this plan?

[1] https://review.opendev.org/#/c/657016/
[2] https://docs.openstack.org/nova/latest/admin/support-compute.html

-- 

Thanks,

Matt