On 6/12/2019 3:38 PM, Matt Riedemann wrote:
What are our options?
1. Don't delete the compute service if we can't cleanup all resource providers - make sure to not orphan any providers. Manual cleanup may be necessary by the operator.
2. Change delete_resource_provider cascade=True logic to remove all allocations for the provider before deleting it, i.e. for not-yet-complete migrations and evacuated instances. For the evacuated instance allocations this is likely OK since restarting the source compute service is going to do that cleanup anyway. Also, if you delete the source compute service during a migration, confirming or reverting the resize later will likely fail since we'd be casting to something that is gone (and we'd orphan those allocations). Maybe we need a functional recreate test for the unconfirmed migration scenario before deciding on this?
3. Other things I'm not thinking of? Should we add a force parameter to the API to allow the operator to forcefully delete (#2 above) if #1 fails? Force parameters are hacky and usually seem to cause more problems than they solve, but it does put the control in the operators hands.
If we did remove allocations for an instance when deleting it's compute service host, the operator should be able to get them back by running the "nova-manage placement heal_allocations" CLI - assuming they restart the compute service on that host. This would have to be tested of course.
After talking a bit about this in IRC today, I'm thinking about a phased approach to this problem with these changes in order: 1. Land [1] so we're at least trying to cleanup all providers for a given compute service (the ironic case). 2. Implement option #1 above where we fail to delete the compute service if any of the resource providers cannot be deleted. We'd have stuff in the logs about completing migrations and trying again, and failing that cleanup allocations for old evacuations. Rather than dump all of that info into the logs, it would probably be better to just write up a troubleshooting doc [2] for it and link to that from the logs, then the doc can reference APIs and CLIs to use for the cleanup scenarios. 3. Implement option #2 above where we cleanup allocations but only for evacuations - like the compute service would do when it's restarted anyway. This would leave the case that we don't delete the compute service for allocations related to other types of migrations - in-progress or unconfirmed (or failed and leaked) migrations that would require operator investigation. We could build on that in the future if we wanted to toy with the idea of checking the service group API for whether or not the service is up or if we wanted to add a force option to just tell nova to fully cascade delete everything, but I don't really want to get hung up on those edge cases right now. How do people feel about this plan? [1] https://review.opendev.org/#/c/657016/ [2] https://docs.openstack.org/nova/latest/admin/support-compute.html -- Thanks, Matt