[nova][ptg] Resource provider delete at service delete
ML thread: http://lists.openstack.org/pipermail/openstack-discuss/2019-June/007135.html Agreements in the room: * Check ongoing migration and reject the delete if migration with this compute having the source node exists. Let operator confirm the migrations * Cascade delete providers and allocations in placement. * in case of evacuated instances this is the right thing to do * in any other dangling allocation case nova has the final thrut so nova has the authority to delete them. * Document possible ways to reconcile Placement with Nova using heal_allocations and eventually the audit command once it's merged. Cheers, gibi
On 11/10/2019 2:09 AM, Balázs Gibizer wrote:
* Check ongoing migration and reject the delete if migration with this compute having the source node exists. Let operator confirm the migrations
To be clear, the suggestion here is call [1] from the API like around [2]? That's a behavior change but so was blocking the delete when the compute was hosting instances [3] and we added a release note for that. Anyway, that's a pretty simple change and not really something I thought about in earlier threads on this problem. Regarding evacuate migration records that should also work since the final states for an evacuate migration are done, failed or error for which [1] accounts.
* Cascade delete providers and allocations in placement. * in case of evacuated instances this is the right thing to do
OK this seems to confirm my TODO here [4].
* in any other dangling allocation case nova has the final thrut so nova has the authority to delete them.
So this would build on the first idea above about blocking the service delete if there are in-progress migrations involving the node (either incoming or outgoing) right? So if we get to the point of deleting the provider we know (1) there are no in-progress migrations and (2) there are no instances on the host (outside of evacuated instances which we can cleanup automatically per [4]). Given that, I'm not sure there is really anything else to do here.
* Document possible ways to reconcile Placement with Nova using heal_allocations and eventually the audit command once it's merged.
Done (merged yesterday) [5]. [1] https://github.com/openstack/nova/blob/20.0.0/nova/objects/migration.py#L240 [2] https://github.com/openstack/nova/blob/20.0.0/nova/api/openstack/compute/ser... [3] https://review.opendev.org/#/c/560674/ [4] https://review.opendev.org/#/c/678100/2/nova/scheduler/client/report.py@2165 [5] https://docs.openstack.org/nova/latest/admin/troubleshooting/orphaned-alloca... -- Thanks, Matt
On Thu, Nov 14, 2019 at 09:35, Matt Riedemann <mriedemos@gmail.com> wrote:
On 11/10/2019 2:09 AM, Balázs Gibizer wrote:
* Check ongoing migration and reject the delete if migration with this compute having the source node exists. Let operator confirm the migrations
To be clear, the suggestion here is call [1] from the API like around [2]? That's a behavior change but so was blocking the delete when the compute was hosting instances [3] and we added a release note for that. Anyway, that's a pretty simple change and not really something I thought about in earlier threads on this problem. Regarding evacuate migration records that should also work since the final states for an evacuate migration are done, failed or error for which [1] accounts.
Yeah, [1] called at [2] sounds good to me. Regarding evacuation records. If the evacuation succeeded, i.e. the migration is in 'done' state then we are OK. But if it is finished with 'error' or 'failed' state then we still have an instance on the host so we should not allow deleting the compute service. As far as I see get_count_by_hosts will cover this case.
* Cascade delete providers and allocations in placement. * in case of evacuated instances this is the right thing to do
OK this seems to confirm my TODO here [4].
* in any other dangling allocation case nova has the final thrut so nova has the authority to delete them.
So this would build on the first idea above about blocking the service delete if there are in-progress migrations involving the node (either incoming or outgoing) right? So if we get to the point of deleting the provider we know (1) there are no in-progress migrations and (2) there are no instances on the host (outside of evacuated instances which we can cleanup automatically per [4]). Given that, I'm not sure there is really anything else to do here.
In theory cannot be any other allocation on the compute RP tree if there is no instance on the host, no ongoing migrations involving the host. But still I guess we need to cascade the delete to make sure that orphaned allocations (which is a bug itself but we no that it happens) are cleaned up when the service is deleted. cheers, gibi
* Document possible ways to reconcile Placement with Nova using heal_allocations and eventually the audit command once it's merged.
Done (merged yesterday) [5].
[1] https://github.com/openstack/nova/blob/20.0.0/nova/objects/migration.py#L240 [2] https://github.com/openstack/nova/blob/20.0.0/nova/api/openstack/compute/ser... [3] https://review.opendev.org/#/c/560674/ [4] https://review.opendev.org/#/c/678100/2/nova/scheduler/client/report.py@2165 [5] https://docs.openstack.org/nova/latest/admin/troubleshooting/orphaned-alloca...
--
Thanks,
Matt
On 11/14/2019 10:06 AM, Balázs Gibizer wrote:
If the evacuation succeeded, i.e. the migration is in 'done' state then we are OK. But if it is finished with 'error' or 'failed' state then we still have an instance on the host so we should not allow deleting the compute service. As far as I see get_count_by_hosts will cover this case.
If the evacuation succeeded then we need to detect it and cleanup the allocation from the evacuated-from-host because get_count_by_hosts won't catch that case (that's bug 1829479). That's the TODO in my patch. If the evacuation failed, I agree with you that get_count_by_hosts should detect and block the deletion of the service since the instance is still hosted there in the DB. -- Thanks, Matt
On 11/14/2019 10:06 AM, Balázs Gibizer wrote:
Yeah, [1] called at [2] sounds good to me.
Done with functional recreate test patches underneath: https://review.opendev.org/#/c/694389/ -- Thanks, Matt
participants (2)
-
Balázs Gibizer
-
Matt Riedemann