[nova][ops] What should the compute service delete behavior be wrt resource providers with allocations?
Before [1] when deleting a compute service in the API we did not check to see if the compute service was hosting any instances and just blindly deleted the service and related compute_node(s) records which orphaned the resource provider(s) for those nodes.
With [2] we built on that and would cleanup the (first [3]) compute node resource provider by first deleting any allocations for instances still on that host - which because of the check in [1] should be none - and then deleted the resource provider itself.
[2] forgot about ironic where a single compute service can be managing multiple (hundreds or even thousands) of baremetal compute nodes so I wrote [3] to delete *all* resource providers for compute nodes tied to the service - again barring there being any instances running on the service because of the check added in [1].
What we've failed to realize until recently is that there are cases where deleting the resource provider can still fail because there are allocations we haven't cleaned up, namely:
1. Residual allocations for evacuated instances from a source host.
2. Allocations held by a migration record for an unconfirmed (or not yet complete) migration.
Because the delete_resource_provider method isn't checking for those, we can get ResourceProviderInUse errors which are then ignored [4]. Since that error is ignored, we continue on to delete the compute service record [5], effectively orphaning the providers (which is what [2] was meant to fix). I have recreated the evacuate scenario in a functional test here [6].
The question is what should we do about the fix? I'm getting lost thinking about this in a vacuum so trying to get some others to help think about it.
Clearly with [1] we said you shouldn't be able to delete a compute service that has instances on it because that corrupts our resource tracking system. If we extend that to any allocations held against providers for that compute service, then the fix might be as simple as not ignoring the ResourceProviderInUse error and fail if we can't delete the provider(s).
The question I'm struggling with is what does an operator do for the two cases mentioned above, not-yet-complete migrations and evacuated instances? For migrations, that seems pretty simple - wait for the migration to complete and confirm it (reverting a cold migration or resize would put the instance back on the compute service host you're trying to delete).
The nastier thing is the allocations tied to an evacuated instance since those don't get cleaned up until the compute service is restarted [7]. If the operator never intends on restarting that compute service and just wants to clear the data, then they have to manually delete the allocations for the resource providers associated with that host before they can delete the compute service, which kind of sucks.
What are our options?
1. Don't delete the compute service if we can't cleanup all resource providers - make sure to not orphan any providers. Manual cleanup may be necessary by the operator.
2. Change delete_resource_provider cascade=True logic to remove all allocations for the provider before deleting it, i.e. for not-yet-complete migrations and evacuated instances. For the evacuated instance allocations this is likely OK since restarting the source compute service is going to do that cleanup anyway. Also, if you delete the source compute service during a migration, confirming or reverting the resize later will likely fail since we'd be casting to something that is gone (and we'd orphan those allocations). Maybe we need a functional recreate test for the unconfirmed migration scenario before deciding on this?
3. Other things I'm not thinking of? Should we add a force parameter to the API to allow the operator to forcefully delete (#2 above) if #1 fails? Force parameters are hacky and usually seem to cause more problems than they solve, but it does put the control in the operators hands.
If we did remove allocations for an instance when deleting it's compute service host, the operator should be able to get them back by running the "nova-manage placement heal_allocations" CLI - assuming they restart the compute service on that host. This would have to be tested of course.
Help me Obi-Wan Kenobi. You're my only hope.
[1] https://review.opendev.org/#/q/I0bd63b655ad3d3d39af8d15c781ce0a45efc8e3a [2] https://review.opendev.org/#/q/I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 [3] https://review.opendev.org/#/c/657016/ [4] https://github.com/openstack/nova/blob/cb0cfc90e1e03e82c42187ec60f46fb8fd590... [5] https://github.com/openstack/nova/blob/cb0cfc90e1e03e82c42187ec60f46fb8fd590... [6] https://review.opendev.org/#/c/663737/ [7] https://github.com/openstack/nova/blob/cb0cfc90e1e03e82c42187ec60f46fb8fd590...
- Change delete_resource_provider cascade=True logic to remove all
allocations for the provider before deleting it, i.e. for not-yet-complete migrations and evacuated instances. For the evacuated instance allocations this is likely OK since restarting the source compute service is going to do that cleanup anyway. Also, if you delete the source compute service during a migration, confirming or reverting the resize later will likely fail since we'd be casting to something that is gone (and we'd orphan those allocations). Maybe we need a functional recreate test for the unconfirmed migration scenario before deciding on this?
This seems like a win to me.
If we can distinguish between the migratey ones and the evacuatey ones, maybe we fail on the former (forcing them to wait for completion) and automatically delete the latter (which is almost always okay for the reasons you state; and recoverable via heal if it's not okay for some reason).
efried .
On Wed, 2019-06-12 at 17:36 -0500, Eric Fried wrote:
- Change delete_resource_provider cascade=True logic to remove all
allocations for the provider before deleting it, i.e. for not-yet-complete migrations and evacuated instances. For the evacuated instance allocations this is likely OK since restarting the source compute service is going to do that cleanup anyway. Also, if you delete the source compute service during a migration, confirming or reverting the resize later will likely fail since we'd be casting to something that is gone (and we'd orphan those allocations). Maybe we need a functional recreate test for the unconfirmed migration scenario before deciding on this?
This seems like a win to me.
If we can distinguish between the migratey ones and the evacuatey ones, maybe we fail on the former (forcing them to wait for completion) and automatically delete the latter (which is almost always okay for the reasons you state; and recoverable via heal if it's not okay for some reason).
for a cold migration the allcoation will be associated with a migration object for evacuate which is basically a rebuild to a different host we do not have a migration object so the consumer uuid for the allcotion are still associated with the instace uuid not a migration uuid. so technically we can tell yes but only if we pull back the allcoation form placmenet and then iterate over them and check if we have a migration object or an instance that has the same uuid.
in the evac case we shoudl also be able to tell that its an evac as the uuid will match an instance but the instnace host will not match the RP name the allcoation is associated with.
so we can figure this out on the nova side by looking at either the instances table or migrations table
or in the futrue when we have consumer types in placement that will also make this simplete to do as the info will be in the allocation itself.
personally i like option 2 but yes we could selectivly force for evac only if we wanted.
efried .
On 6/12/2019 7:05 PM, Sean Mooney wrote:
If we can distinguish between the migratey ones and the evacuatey ones, maybe we fail on the former (forcing them to wait for completion) and automatically delete the latter (which is almost always okay for the reasons you state; and recoverable via heal if it's not okay for some reason).
for a cold migration the allcoation will be associated with a migration object for evacuate which is basically a rebuild to a different host we do not have a migration object so the consumer uuid for the allcotion are still associated with the instace uuid not a migration uuid. so technically we can tell yes but only if we pull back the allcoation form placmenet and then iterate over them and check if we have a migration object or an instance that has the same uuid.
Evacuate operations do have a migration record but you're right that we don't move the source node allocations from the instance to the migration prior to scheduling (like we do for cold and live migration). So after the evacuation, the instance consumer has allocations on both the source and dest node.
If we did what Eric is suggesting, which is kind of a mix between option 1 and option 2, then I'd do the same query as we have on restart of the compute service [1] to find migration records for evacuations concerning the host we're being asked to delete within a certain status and clean those up, then (re?)try the resource provider delete - and if that fails, then we punt and fail the request to delete the compute service because we couldn't safely delete the resource provider (and we don't want to orphan it for the reasons mnaser pointed out).
[1] https://github.com/openstack/nova/blob/61558f274842b149044a14bbe7537b9f27803...
Hi Matt,
Hoping I can bring an operator's perspective.
On 6/12/19 10:38 PM, Matt Riedemann wrote:
- Don't delete the compute service if we can't cleanup all resource
providers - make sure to not orphan any providers. Manual cleanup may be necessary by the operator.
I'd say that this option is ok-ish *IF* the operators are given good enough directives saying what to do. It would really suck if we just get an error, and don't know what resource cleanup is needed. But if the error is:
Cannot delete nova-compute on host mycloud-compute-5. Instances still running: 623051e7-4e0d-4b06-b977-1d9a73e6e6e1 f8483448-39b5-4981-a731-5f4eeb28592c Currently live-migrating: 49a12659-9dc6-4b07-b38b-e0bf2a69820a Not confirmed migration/resize: cc3d4311-e252-4922-bf04-dedc31b3a425
then that's fine, we know what to do. And better: the operator will know better than nova what to do. Maybe live-migrate the instances? Or maybe just destroy them? Nova shouldn't attempt to double-guess what the operator has in mind.
- Change delete_resource_provider cascade=True logic to remove all
allocations for the provider before deleting it, i.e. for not-yet-complete migrations and evacuated instances. For the evacuated instance allocations this is likely OK since restarting the source compute service is going to do that cleanup anyway. Also, if you delete the source compute service during a migration, confirming or reverting the resize later will likely fail since we'd be casting to something that is gone (and we'd orphan those allocations). Maybe we need a functional recreate test for the unconfirmed migration scenario before deciding on this?
I don't see how this is going to help more than an evacuate command. Or is the intend to do the evacuate, then right after it, the deletion of the resource provider?
- Other things I'm not thinking of? Should we add a force parameter to
the API to allow the operator to forcefully delete (#2 above) if #1 fails? Force parameters are hacky and usually seem to cause more problems than they solve, but it does put the control in the operators hands.
Let's say the --force is just doing the resize --confirm for the operator, or do an evacuate, then that's fine (and in fact, a good idea, automations are great...). If it's going to create a mess in the DB, then it's IMO a terrible idea.
However, I see a case that may happen: image a compute node is completely broken (think: broken motherboard...), then probably we do want to remove everything that's in there, and want to handle the case where nova-compute doesn't even respond. This very much is a real life scenario. If your --force is to address this case, then why not! Though again and of course, we don't want a mess in the db... :P
I hope this helps,
Cheers,
Thomas Goirand (zigo)
On 6/12/2019 5:50 PM, Thomas Goirand wrote:
- Don't delete the compute service if we can't cleanup all resource
providers - make sure to not orphan any providers. Manual cleanup may be necessary by the operator.
I'd say that this option is ok-ish*IF* the operators are given good enough directives saying what to do. It would really suck if we just get an error, and don't know what resource cleanup is needed. But if the error is:
Cannot delete nova-compute on host mycloud-compute-5. Instances still running: 623051e7-4e0d-4b06-b977-1d9a73e6e6e1 f8483448-39b5-4981-a731-5f4eeb28592c Currently live-migrating: 49a12659-9dc6-4b07-b38b-e0bf2a69820a Not confirmed migration/resize: cc3d4311-e252-4922-bf04-dedc31b3a425
I don't think we'll realistically generate a report like this for an error response in the API. While we could figure this out, for the baremetal case we could have hundreds of instances still managed by that compute service host which is a lot of data to generate for an error response.
I guess it could be a warning dumped into the API logs but it could still be a lot of data to crunch and log.
On 6/13/19 7:40 PM, Matt Riedemann wrote:
On 6/12/2019 5:50 PM, Thomas Goirand wrote:
- Don't delete the compute service if we can't cleanup all resource
providers - make sure to not orphan any providers. Manual cleanup may be necessary by the operator.
I'd say that this option is ok-ish*IF* the operators are given good enough directives saying what to do. It would really suck if we just get an error, and don't know what resource cleanup is needed. But if the error is:
Cannot delete nova-compute on host mycloud-compute-5. Instances still running: 623051e7-4e0d-4b06-b977-1d9a73e6e6e1 f8483448-39b5-4981-a731-5f4eeb28592c Currently live-migrating: 49a12659-9dc6-4b07-b38b-e0bf2a69820a Not confirmed migration/resize: cc3d4311-e252-4922-bf04-dedc31b3a425
I don't think we'll realistically generate a report like this for an error response in the API. While we could figure this out, for the baremetal case we could have hundreds of instances still managed by that compute service host which is a lot of data to generate for an error response.
I guess it could be a warning dumped into the API logs but it could still be a lot of data to crunch and log.
In such case, in the error message, just suggest what to do to fix the issue.
I once worked in a company that made me change every error message so that each of them contained hints on what to do to fix the problem. Since, I often suggest it.
Cheers,
Thomas Goirand (zigo)
On 6/13/2019 4:03 PM, Thomas Goirand wrote:
I once worked in a company that made me change every error message so that each of them contained hints on what to do to fix the problem. Since, I often suggest it.
Heh, same and while it was grueling for the developers it left an impression on me and I tend to try and nack people's changes for crappy error messages as a result.
On 6/12/2019 5:50 PM, Thomas Goirand wrote:
- Other things I'm not thinking of? Should we add a force parameter to
the API to allow the operator to forcefully delete (#2 above) if #1 fails? Force parameters are hacky and usually seem to cause more problems than they solve, but it does put the control in the operators hands.
Let's say the --force is just doing the resize --confirm for the operator, or do an evacuate, then that's fine (and in fact, a good idea, automations are great...). If it's going to create a mess in the DB, then it's IMO a terrible idea.
I really don't think we're going to change the delete compute service API into an orchestrator that auto-confirms/evacuates the node(s) for you. This is something an external agent / script / service could determine, perform whatever actions, and retry, based on existing APIs (like the migrations API). The one catch is the evacuated instance allocations - there is not much you can do about those from the compute API, you would have to cleanup the allocations for those via the placement API directly.
However, I see a case that may happen: image a compute node is completely broken (think: broken motherboard...), then probably we do want to remove everything that's in there, and want to handle the case where nova-compute doesn't even respond. This very much is a real life scenario. If your --force is to address this case, then why not! Though again and of course, we don't want a mess in the db... :P
Well, that's where a force parameter would be available to the admin to decide what they want to happen depending on the situation rather than just have nova guess and hope it's what you wanted.
We could check if the service is "up" using the service group API and make some determinations that way, i.e. if there are still allocations on the thing and it's down, assume you're deleting it because it's dead and you want it gone so we just cleanup the allocations for you.
On Wed, Jun 12, 2019 at 4:44 PM Matt Riedemann mriedemos@gmail.com wrote:
Before [1] when deleting a compute service in the API we did not check to see if the compute service was hosting any instances and just blindly deleted the service and related compute_node(s) records which orphaned the resource provider(s) for those nodes.
With [2] we built on that and would cleanup the (first [3]) compute node resource provider by first deleting any allocations for instances still on that host - which because of the check in [1] should be none - and then deleted the resource provider itself.
[2] forgot about ironic where a single compute service can be managing multiple (hundreds or even thousands) of baremetal compute nodes so I wrote [3] to delete *all* resource providers for compute nodes tied to the service - again barring there being any instances running on the service because of the check added in [1].
What we've failed to realize until recently is that there are cases where deleting the resource provider can still fail because there are allocations we haven't cleaned up, namely:
Residual allocations for evacuated instances from a source host.
Allocations held by a migration record for an unconfirmed (or not yet
complete) migration.
Because the delete_resource_provider method isn't checking for those, we can get ResourceProviderInUse errors which are then ignored [4]. Since that error is ignored, we continue on to delete the compute service record [5], effectively orphaning the providers (which is what [2] was meant to fix). I have recreated the evacuate scenario in a functional test here [6].
The question is what should we do about the fix? I'm getting lost thinking about this in a vacuum so trying to get some others to help think about it.
Clearly with [1] we said you shouldn't be able to delete a compute service that has instances on it because that corrupts our resource tracking system. If we extend that to any allocations held against providers for that compute service, then the fix might be as simple as not ignoring the ResourceProviderInUse error and fail if we can't delete the provider(s).
The question I'm struggling with is what does an operator do for the two cases mentioned above, not-yet-complete migrations and evacuated instances? For migrations, that seems pretty simple - wait for the migration to complete and confirm it (reverting a cold migration or resize would put the instance back on the compute service host you're trying to delete).
The nastier thing is the allocations tied to an evacuated instance since those don't get cleaned up until the compute service is restarted [7]. If the operator never intends on restarting that compute service and just wants to clear the data, then they have to manually delete the allocations for the resource providers associated with that host before they can delete the compute service, which kind of sucks.
What are our options?
- Don't delete the compute service if we can't cleanup all resource
providers - make sure to not orphan any providers. Manual cleanup may be necessary by the operator.
I'm personally in favor of this. I think that currently a lot of operators don't really think of the placement service much (or perhaps don't really know what it's doing).
There's a lack of transparency in the data that exists in that service, a lot of users will actually rely on the information fed by *nova* and not *placement*.
Because of this, I've seen a lot of deployments with stale placement records or issues with clouds where the hypervisors are not efficiently used because of a bunch of stale resource allocations that haven't been cleaned up (and counting on deployers watching logs for warnings.. eh)
I would be more in favor of failing a delete if it will cause the cloud to reach an inconsistent state than brute-force a delete leaving you in a messy state where you need to login to the database to unkludge things.
- Change delete_resource_provider cascade=True logic to remove all
allocations for the provider before deleting it, i.e. for not-yet-complete migrations and evacuated instances. For the evacuated instance allocations this is likely OK since restarting the source compute service is going to do that cleanup anyway. Also, if you delete the source compute service during a migration, confirming or reverting the resize later will likely fail since we'd be casting to something that is gone (and we'd orphan those allocations). Maybe we need a functional recreate test for the unconfirmed migration scenario before deciding on this?
- Other things I'm not thinking of? Should we add a force parameter to
the API to allow the operator to forcefully delete (#2 above) if #1 fails? Force parameters are hacky and usually seem to cause more problems than they solve, but it does put the control in the operators hands.
If we did remove allocations for an instance when deleting it's compute service host, the operator should be able to get them back by running the "nova-manage placement heal_allocations" CLI - assuming they restart the compute service on that host. This would have to be tested of course.
Help me Obi-Wan Kenobi. You're my only hope.
[1] https://review.opendev.org/#/q/I0bd63b655ad3d3d39af8d15c781ce0a45efc8e3a [2] https://review.opendev.org/#/q/I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 [3] https://review.opendev.org/#/c/657016/ [4] https://github.com/openstack/nova/blob/cb0cfc90e1e03e82c42187ec60f46fb8fd590... [5] https://github.com/openstack/nova/blob/cb0cfc90e1e03e82c42187ec60f46fb8fd590... [6] https://review.opendev.org/#/c/663737/ [7] https://github.com/openstack/nova/blob/cb0cfc90e1e03e82c42187ec60f46fb8fd590...
--
Thanks,
Matt
On Wed, 12 Jun 2019, Mohammed Naser wrote:
On Wed, Jun 12, 2019 at 4:44 PM Matt Riedemann mriedemos@gmail.com wrote:
- Don't delete the compute service if we can't cleanup all resource
providers - make sure to not orphan any providers. Manual cleanup may be necessary by the operator.
I'm personally in favor of this. I think that currently a lot of operators don't really think of the placement service much (or perhaps don't really know what it's doing).
There's a lack of transparency in the data that exists in that service, a lot of users will actually rely on the information fed by *nova* and not *placement*.
I agree, and this is part of why I prefer #2 over #1. For someone dealing with a deleted nova compute service, placement shouldn't be something they need to be all that concerned with. Nova should be mediating the interactions with placement to correct the model of reality that it is storing there. That's what option 2 is doing: fixing the model, from nova.
(Obviously this is an idealisation that we've not achieved, which is I why I used that horrible word "should", but I do think it is something we should be striving towards.)
Please: https://en.wikipedia.org/wiki/Posting_style#Trimming_and_reformatting
/me scurries back to Usenet
On 6/12/2019 6:26 PM, Mohammed Naser wrote:
I would be more in favor of failing a delete if it will cause the cloud to reach an inconsistent state than brute-force a delete leaving you in a messy state where you need to login to the database to unkludge things.
I'm not sure that the cascading delete (option #2) case would leave things in a messy state since we'd delete the stuff that we're actually orphaning today. If we don't cascade delete for you and just let the request fail if there are still allocations (option #1), then like I said in a reply to zigo, there are APIs available to figure out what's still being used on the host and then clean those up - but that's the manual part I'm talking about since nova wouldn't be doing it for you.
On Wed, 12 Jun 2019, Matt Riedemann wrote:
- Change delete_resource_provider cascade=True logic to remove all
allocations for the provider before deleting it, i.e. for not-yet-complete migrations and evacuated instances. For the evacuated instance allocations this is likely OK since restarting the source compute service is going to do that cleanup anyway. Also, if you delete the source compute service during a migration, confirming or reverting the resize later will likely fail since we'd be casting to something that is gone (and we'd orphan those allocations). Maybe we need a functional recreate test for the unconfirmed migration scenario before deciding on this?
I think this is likely the right choice. If the service is being deleted (not disabled) it shouldn't have a resource provider and to not have a resource provider it needs to not have allocations, and of those left over allocations that it does have are either bogus now, or will be soon enough, may as well get them gone in a consistent and predictable way.
That said, we shouldn't make a habit of a removing allocations just so we can remove a resource provider whenever we want, only in special cases like this.
If/when we're modelling shared disk as a shared resource provider does this get any more complicated? Does the part of an allocation that is DISK_GB need special handling.
- Other things I'm not thinking of? Should we add a force parameter to the
API to allow the operator to forcefully delete (#2 above) if #1 fails? Force parameters are hacky and usually seem to cause more problems than they solve, but it does put the control in the operators hands.
I'm sort of maybe on this. A #1, with an option to inspect and then #2 seems friendly and potentially useful but how often is someone going to want to inspect versus just "whatevs, #2"? I don't know.
On 6/13/2019 4:04 AM, Chris Dent wrote:
If/when we're modelling shared disk as a shared resource provider does this get any more complicated? Does the part of an allocation that is DISK_GB need special handling.
Nova doesn't create nor manage shared resource providers today, so deleting the compute service and its related compute node(s) and their related resource provider(s) shouldn't have anything to do with a shared resource provider.
On Thu, 13 Jun 2019, Matt Riedemann wrote:
On 6/13/2019 4:04 AM, Chris Dent wrote:
If/when we're modelling shared disk as a shared resource provider does this get any more complicated? Does the part of an allocation that is DISK_GB need special handling.
Nova doesn't create nor manage shared resource providers today, so deleting the compute service and its related compute node(s) and their related resource provider(s) shouldn't have anything to do with a shared resource provider.
Yeah, "today". That's why I said "If/when". If we do start doing that, does that make things more complicated in a way we may wish to think about _now_ while we're designing today's solution?
I'd like to think that we can just ignore it for now and adapt as things change in the future, but we're all familiar with the way that everything is way more connected and twisted up in a scary hairy ball in nova than we'd all like.
On 6/12/2019 3:38 PM, Matt Riedemann wrote:
What are our options?
- Don't delete the compute service if we can't cleanup all resource
providers - make sure to not orphan any providers. Manual cleanup may be necessary by the operator.
- Change delete_resource_provider cascade=True logic to remove all
allocations for the provider before deleting it, i.e. for not-yet-complete migrations and evacuated instances. For the evacuated instance allocations this is likely OK since restarting the source compute service is going to do that cleanup anyway. Also, if you delete the source compute service during a migration, confirming or reverting the resize later will likely fail since we'd be casting to something that is gone (and we'd orphan those allocations). Maybe we need a functional recreate test for the unconfirmed migration scenario before deciding on this?
- Other things I'm not thinking of? Should we add a force parameter to
the API to allow the operator to forcefully delete (#2 above) if #1 fails? Force parameters are hacky and usually seem to cause more problems than they solve, but it does put the control in the operators hands.
If we did remove allocations for an instance when deleting it's compute service host, the operator should be able to get them back by running the "nova-manage placement heal_allocations" CLI - assuming they restart the compute service on that host. This would have to be tested of course.
After talking a bit about this in IRC today, I'm thinking about a phased approach to this problem with these changes in order:
1. Land [1] so we're at least trying to cleanup all providers for a given compute service (the ironic case).
2. Implement option #1 above where we fail to delete the compute service if any of the resource providers cannot be deleted. We'd have stuff in the logs about completing migrations and trying again, and failing that cleanup allocations for old evacuations. Rather than dump all of that info into the logs, it would probably be better to just write up a troubleshooting doc [2] for it and link to that from the logs, then the doc can reference APIs and CLIs to use for the cleanup scenarios.
3. Implement option #2 above where we cleanup allocations but only for evacuations - like the compute service would do when it's restarted anyway.
This would leave the case that we don't delete the compute service for allocations related to other types of migrations - in-progress or unconfirmed (or failed and leaked) migrations that would require operator investigation. We could build on that in the future if we wanted to toy with the idea of checking the service group API for whether or not the service is up or if we wanted to add a force option to just tell nova to fully cascade delete everything, but I don't really want to get hung up on those edge cases right now.
How do people feel about this plan?
[1] https://review.opendev.org/#/c/657016/ [2] https://docs.openstack.org/nova/latest/admin/support-compute.html
On 6/13/2019 1:45 PM, Matt Riedemann wrote:
- Implement option #1 above where we fail to delete the compute service
if any of the resource providers cannot be deleted. We'd have stuff in the logs about completing migrations and trying again, and failing that cleanup allocations for old evacuations. Rather than dump all of that info into the logs, it would probably be better to just write up a troubleshooting doc [2] for it and link to that from the logs, then the doc can reference APIs and CLIs to use for the cleanup scenarios.
It's been a couple of months but I finally got around to starting this [1]. There are several TODOs in there but I've updated the functional test to show we're no longer orphaning the resource provider. There are also questions about what to do if we hit this in the compute manager during an ironic node re-balance (different issue but it touches the same delete_resource_provider code). I haven't started on a troubleshooting doc yet since I'm waiting on the novaclient change [2] to land which will be part of that (a CLI to find certain types of migration records on the source compute).
[1] https://review.opendev.org/#/c/678100/ [2] https://review.opendev.org/#/c/675117/
participants (6)
-
Chris Dent
-
Eric Fried
-
Matt Riedemann
-
Mohammed Naser
-
Sean Mooney
-
Thomas Goirand