Open Stack

Thu Jun 13 18:45:31 UTC 2019

On 6/12/2019 3:38 PM, Matt Riedemann wrote:
> What are our options?
> 
> 1. Don't delete the compute service if we can't cleanup all resource 
> providers - make sure to not orphan any providers. Manual cleanup may be 
> necessary by the operator.
> 
> 2. Change delete_resource_provider cascade=True logic to remove all 
> allocations for the provider before deleting it, i.e. for 
> not-yet-complete migrations and evacuated instances. For the evacuated 
> instance allocations this is likely OK since restarting the source 
> compute service is going to do that cleanup anyway. Also, if you delete 
> the source compute service during a migration, confirming or reverting 
> the resize later will likely fail since we'd be casting to something 
> that is gone (and we'd orphan those allocations). Maybe we need a 
> functional recreate test for the unconfirmed migration scenario before 
> deciding on this?
> 
> 3. Other things I'm not thinking of? Should we add a force parameter to 
> the API to allow the operator to forcefully delete (#2 above) if #1 
> fails? Force parameters are hacky and usually seem to cause more 
> problems than they solve, but it does put the control in the operators 
> hands.
> 
> If we did remove allocations for an instance when deleting it's compute 
> service host, the operator should be able to get them back by running 
> the "nova-manage placement heal_allocations" CLI - assuming they restart 
> the compute service on that host. This would have to be tested of course.

After talking a bit about this in IRC today, I'm thinking about a phased 
approach to this problem with these changes in order:

1. Land [1] so we're at least trying to cleanup all providers for a 
given compute service (the ironic case).

2. Implement option #1 above where we fail to delete the compute service 
if any of the resource providers cannot be deleted. We'd have stuff in 
the logs about completing migrations and trying again, and failing that 
cleanup allocations for old evacuations. Rather than dump all of that 
info into the logs, it would probably be better to just write up a 
troubleshooting doc [2] for it and link to that from the logs, then the 
doc can reference APIs and CLIs to use for the cleanup scenarios.

3. Implement option #2 above where we cleanup allocations but only for 
evacuations - like the compute service would do when it's restarted anyway.

This would leave the case that we don't delete the compute service for 
allocations related to other types of migrations - in-progress or 
unconfirmed (or failed and leaked) migrations that would require 
operator investigation. We could build on that in the future if we 
wanted to toy with the idea of checking the service group API for 
whether or not the service is up or if we wanted to add a force option 
to just tell nova to fully cascade delete everything, but I don't really 
want to get hung up on those edge cases right now.

How do people feel about this plan?

[1] https://review.opendev.org/#/c/657016/
[2] https://docs.openstack.org/nova/latest/admin/support-compute.html

-- 

Thanks,

Matt

Open Stack

[nova][ops] What should the compute service delete behavior be wrt resource providers with allocations?

OpenStack

Community

Documentation

Branding & Legal