[nova][ops] What should the compute service delete behavior be wrt resource providers with allocations?
mriedemos at gmail.com
Thu Jun 13 18:45:31 UTC 2019
On 6/12/2019 3:38 PM, Matt Riedemann wrote:
> What are our options?
> 1. Don't delete the compute service if we can't cleanup all resource
> providers - make sure to not orphan any providers. Manual cleanup may be
> necessary by the operator.
> 2. Change delete_resource_provider cascade=True logic to remove all
> allocations for the provider before deleting it, i.e. for
> not-yet-complete migrations and evacuated instances. For the evacuated
> instance allocations this is likely OK since restarting the source
> compute service is going to do that cleanup anyway. Also, if you delete
> the source compute service during a migration, confirming or reverting
> the resize later will likely fail since we'd be casting to something
> that is gone (and we'd orphan those allocations). Maybe we need a
> functional recreate test for the unconfirmed migration scenario before
> deciding on this?
> 3. Other things I'm not thinking of? Should we add a force parameter to
> the API to allow the operator to forcefully delete (#2 above) if #1
> fails? Force parameters are hacky and usually seem to cause more
> problems than they solve, but it does put the control in the operators
> If we did remove allocations for an instance when deleting it's compute
> service host, the operator should be able to get them back by running
> the "nova-manage placement heal_allocations" CLI - assuming they restart
> the compute service on that host. This would have to be tested of course.
After talking a bit about this in IRC today, I'm thinking about a phased
approach to this problem with these changes in order:
1. Land  so we're at least trying to cleanup all providers for a
given compute service (the ironic case).
2. Implement option #1 above where we fail to delete the compute service
if any of the resource providers cannot be deleted. We'd have stuff in
the logs about completing migrations and trying again, and failing that
cleanup allocations for old evacuations. Rather than dump all of that
info into the logs, it would probably be better to just write up a
troubleshooting doc  for it and link to that from the logs, then the
doc can reference APIs and CLIs to use for the cleanup scenarios.
3. Implement option #2 above where we cleanup allocations but only for
evacuations - like the compute service would do when it's restarted anyway.
This would leave the case that we don't delete the compute service for
allocations related to other types of migrations - in-progress or
unconfirmed (or failed and leaked) migrations that would require
operator investigation. We could build on that in the future if we
wanted to toy with the idea of checking the service group API for
whether or not the service is up or if we wanted to add a force option
to just tell nova to fully cascade delete everything, but I don't really
want to get hung up on those edge cases right now.
How do people feel about this plan?
More information about the openstack-discuss