[nova][ops] What should the compute service delete behavior be wrt resource providers with allocations?

Thomas Goirand zigo at debian.org
Wed Jun 12 22:50:16 UTC 2019


Hi Matt,

Hoping I can bring an operator's perspective.

On 6/12/19 10:38 PM, Matt Riedemann wrote:
> 1. Don't delete the compute service if we can't cleanup all resource
> providers - make sure to not orphan any providers. Manual cleanup may be
> necessary by the operator.

I'd say that this option is ok-ish *IF* the operators are given good
enough directives saying what to do. It would really suck if we just get
an error, and don't know what resource cleanup is needed. But if the
error is:

Cannot delete nova-compute on host mycloud-compute-5.
Instances still running:
623051e7-4e0d-4b06-b977-1d9a73e6e6e1
f8483448-39b5-4981-a731-5f4eeb28592c
Currently live-migrating:
49a12659-9dc6-4b07-b38b-e0bf2a69820a
Not confirmed migration/resize:
cc3d4311-e252-4922-bf04-dedc31b3a425

then that's fine, we know what to do. And better: the operator will know
better than nova what to do. Maybe live-migrate the instances? Or maybe
just destroy them? Nova shouldn't attempt to double-guess what the
operator has in mind.

> 2. Change delete_resource_provider cascade=True logic to remove all
> allocations for the provider before deleting it, i.e. for
> not-yet-complete migrations and evacuated instances. For the evacuated
> instance allocations this is likely OK since restarting the source
> compute service is going to do that cleanup anyway. Also, if you delete
> the source compute service during a migration, confirming or reverting
> the resize later will likely fail since we'd be casting to something
> that is gone (and we'd orphan those allocations). Maybe we need a
> functional recreate test for the unconfirmed migration scenario before
> deciding on this?

I don't see how this is going to help more than an evacuate command. Or
is the intend to do the evacuate, then right after it, the deletion of
the resource provider?

> 3. Other things I'm not thinking of? Should we add a force parameter to
> the API to allow the operator to forcefully delete (#2 above) if #1
> fails? Force parameters are hacky and usually seem to cause more
> problems than they solve, but it does put the control in the operators
> hands.

Let's say the --force is just doing the resize --confirm for the
operator, or do an evacuate, then that's fine (and in fact, a good idea,
automations are great...). If it's going to create a mess in the DB,
then it's IMO a terrible idea.

However, I see a case that may happen: image a compute node is
completely broken (think: broken motherboard...), then probably we do
want to remove everything that's in there, and want to handle the case
where nova-compute doesn't even respond. This very much is a real life
scenario. If your --force is to address this case, then why not! Though
again and of course, we don't want a mess in the db... :P

I hope this helps,

Cheers,

Thomas Goirand (zigo)



More information about the openstack-discuss mailing list