Hi Matt,
Hoping I can bring an operator's perspective.
On 6/12/19 10:38 PM, Matt Riedemann wrote:
- Don't delete the compute service if we can't cleanup all resource
providers - make sure to not orphan any providers. Manual cleanup may be necessary by the operator.
I'd say that this option is ok-ish *IF* the operators are given good enough directives saying what to do. It would really suck if we just get an error, and don't know what resource cleanup is needed. But if the error is:
Cannot delete nova-compute on host mycloud-compute-5. Instances still running: 623051e7-4e0d-4b06-b977-1d9a73e6e6e1 f8483448-39b5-4981-a731-5f4eeb28592c Currently live-migrating: 49a12659-9dc6-4b07-b38b-e0bf2a69820a Not confirmed migration/resize: cc3d4311-e252-4922-bf04-dedc31b3a425
then that's fine, we know what to do. And better: the operator will know better than nova what to do. Maybe live-migrate the instances? Or maybe just destroy them? Nova shouldn't attempt to double-guess what the operator has in mind.
- Change delete_resource_provider cascade=True logic to remove all
allocations for the provider before deleting it, i.e. for not-yet-complete migrations and evacuated instances. For the evacuated instance allocations this is likely OK since restarting the source compute service is going to do that cleanup anyway. Also, if you delete the source compute service during a migration, confirming or reverting the resize later will likely fail since we'd be casting to something that is gone (and we'd orphan those allocations). Maybe we need a functional recreate test for the unconfirmed migration scenario before deciding on this?
I don't see how this is going to help more than an evacuate command. Or is the intend to do the evacuate, then right after it, the deletion of the resource provider?
- Other things I'm not thinking of? Should we add a force parameter to
the API to allow the operator to forcefully delete (#2 above) if #1 fails? Force parameters are hacky and usually seem to cause more problems than they solve, but it does put the control in the operators hands.
Let's say the --force is just doing the resize --confirm for the operator, or do an evacuate, then that's fine (and in fact, a good idea, automations are great...). If it's going to create a mess in the DB, then it's IMO a terrible idea.
However, I see a case that may happen: image a compute node is completely broken (think: broken motherboard...), then probably we do want to remove everything that's in there, and want to handle the case where nova-compute doesn't even respond. This very much is a real life scenario. If your --force is to address this case, then why not! Though again and of course, we don't want a mess in the db... :P
I hope this helps,
Cheers,
Thomas Goirand (zigo)