[nova] track error migrations and orphans in Resource Tracker
Hi Nova experts, "Not tracking error migrations and orphans in RT." is probably a bug. This may trigger some problems in update_available_resources in RT at the moment. That is some orphans or error migrations are using cpus/memory/disk etc, but we don't take these usage into consideration. And instance.resources is introduced from Train used to contain specific resources, we also track assigned specific resources in RT based on tracked migrations and instances. So this bug will also affect the specific resources tracking. I draft an doc to clarify this bug and possible solutions: https://etherpad.openstack.org/p/track-err-migr-and-orphans-in-RT Looking forward to suggestions from you. Thanks in advance. Best Regards, Luyao
On Tue, 2019-11-12 at 05:46 +0000, Zhong, Luyao wrote:
Hi Nova experts,
"Not tracking error migrations and orphans in RT." is probably a bug. This may trigger some problems in update_available_resources in RT at the moment. That is some orphans or error migrations are using cpus/memory/disk etc, but we don't take these usage into consideration. And instance.resources is introduced from Train used to contain specific resources, we also track assigned specific resources in RT based on tracked migrations and instances. So this bug will also affect the specific resources tracking.
I draft an doc to clarify this bug and possible solutions: https://etherpad.openstack.org/p/track-err-migr-and-orphans-in-RT Looking forward to suggestions from you. Thanks in advance.
there are patche up to allow cleaning up orpahn instances https://review.opendev.org/#/c/627765/ https://review.opendev.org/#/c/648912/ if we can get those merged that woudl adress at least some of the proablem
Best Regards, Luyao
Sean Mooney <smooney@redhat.com> 于2019年11月12日周二 下午9:27写道:
Hi Nova experts,
"Not tracking error migrations and orphans in RT." is probably a bug. This may trigger some problems in update_available_resources in RT at the moment. That is some orphans or error migrations are using cpus/memory/disk etc, but we don't take these usage into consideration. And instance.resources is introduced from Train used to contain specific resources, we also track assigned specific resources in RT
On Tue, 2019-11-12 at 05:46 +0000, Zhong, Luyao wrote: based on tracked migrations and instances. So this
bug will also affect the specific resources tracking.
I draft an doc to clarify this bug and possible solutions: https://etherpad.openstack.org/p/track-err-migr-and-orphans-in-RT Looking forward to suggestions from you. Thanks in advance.
there are patche up to allow cleaning up orpahn instances https://review.opendev.org/#/c/627765/ https://review.opendev.org/#/c/648912/ if we can get those merged that woudl adress at least some of the proablem
Yes, and we separate the issue to be two parts, one part is tracking, another part is cleanup. Yongli's patch will help on cleanup.
Best Regards, Luyao
On 11/12/19 05:18, Sean Mooney wrote:
On Tue, 2019-11-12 at 05:46 +0000, Zhong, Luyao wrote:
Hi Nova experts,
"Not tracking error migrations and orphans in RT." is probably a bug. This may trigger some problems in update_available_resources in RT at the moment. That is some orphans or error migrations are using cpus/memory/disk etc, but we don't take these usage into consideration. And instance.resources is introduced from Train used to contain specific resources, we also track assigned specific resources in RT based on tracked migrations and instances. So this bug will also affect the specific resources tracking.
I draft an doc to clarify this bug and possible solutions: https://etherpad.openstack.org/p/track-err-migr-and-orphans-in-RT Looking forward to suggestions from you. Thanks in advance.
there are patche up to allow cleaning up orpahn instances https://review.opendev.org/#/c/627765/ https://review.opendev.org/#/c/648912/ if we can get those merged that woudl adress at least some of the proablem
I just wanted to mention: I have reviewed the cleanup patches ^ multiple times and I'm having a hard time getting past the fact that any way you slice it (AFAICT), the cleanup code will have a window where a valid guest could be destroyed erroneously (not an orphan). This is because the "get instance list by host" can miss instances that are mid-migration, because of how/where we update the instance.host field. Maybe this ^ could be acceptable (?) if we put a big fat warning on the config option help for 'reap_unknown'. But I was unsure of the answers about what recovery looks like in case a guest is erroneously destroyed for an instance that is in the middle of migrating. In the case of resize or cold migrate, a hard reboot would fix it AFAIK. What about for a live migration? If recovery is possible in every case, those would also need to be documented in the config option help for 'reap_unknown'. The patch has lots of complexities to think about and I'm left wondering if the pitfalls are better or worse than the current state. It would help if others joined in the review with their thoughts about it. -melanie
On 11/13/2019 1:43 PM, melanie witt wrote:
This is because the "get instance list by host" can miss instances that are mid-migration, because of how/where we update the instance.host field.
Why not just filter out any instances that have a non-None task_state? Or barring that, filter out any instances that have an in-progress migration (there is a method that the ResourceTracker uses to get those kinds of migrations occurring either as incoming to or outgoing from the host). -- Thanks, Matt
On 11/13/19 11:53, Matt Riedemann wrote:
On 11/13/2019 1:43 PM, melanie witt wrote:
This is because the "get instance list by host" can miss instances that are mid-migration, because of how/where we update the instance.host field.
Why not just filter out any instances that have a non-None task_state? Or barring that, filter out any instances that have an in-progress migration (there is a method that the ResourceTracker uses to get those kinds of migrations occurring either as incoming to or outgoing from the host).
Yeah, an earlier version of the patch was trying to do that: https://review.opendev.org/#/c/627765/36/nova/compute/manager.py@8455 but it was not a complete list of all the possible migrating intermediate states. We didn't know about the method the resource tracker is already using for the same purpose, that we could re-use. After some confusion on my part, we removed the task_state checks and now I see we need to put them back. I'll find the RT method and comment on the review. Thanks for mentioning that. -melanie
On 11/13/2019 3:21 PM, melanie witt wrote:
I'll find the RT method and comment on the review.
https://github.com/openstack/nova/blob/1c7a3d59080e5de50615bd2408b10d372ec30... -- Thanks, Matt
On 2019/11/14 上午3:43, melanie witt wrote:
On 11/12/19 05:18, Sean Mooney wrote:
On Tue, 2019-11-12 at 05:46 +0000, Zhong, Luyao wrote:
Hi Nova experts,
"Not tracking error migrations and orphans in RT." is probably a bug. This may trigger some problems in update_available_resources in RT at the moment. That is some orphans or error migrations are using cpus/memory/disk etc, but we don't take these usage into consideration. And instance.resources is introduced from Train used to contain specific resources, we also track assigned specific resources in RT based on tracked migrations and instances. So this bug will also affect the specific resources tracking.
I draft an doc to clarify this bug and possible solutions: https://etherpad.openstack.org/p/track-err-migr-and-orphans-in-RT Looking forward to suggestions from you. Thanks in advance.
there are patche up to allow cleaning up orpahn instances https://review.opendev.org/#/c/627765/ https://review.opendev.org/#/c/648912/ if we can get those merged that woudl adress at least some of the proablem
I just wanted to mention:
I have reviewed the cleanup patches ^ multiple times and I'm having a hard time getting past the fact that any way you slice it (AFAICT), the cleanup code will have a window where a valid guest could be destroyed erroneously (not an orphan). This is because the "get instance list by host" can miss instances that are mid-migration, because of how/where we update the instance.host field.
Maybe this ^ could be acceptable (?) if we put a big fat warning on the config option help for 'reap_unknown'. But I was unsure of the answers about what recovery looks like in case a guest is erroneously destroyed for an instance that is in the middle of migrating. In the case of resize or cold migrate, a hard reboot would fix it AFAIK. What about for a live migration? If recovery is possible in every case, those would also need to be documented in the config option help for 'reap_unknown'.
The patch has lots of complexities to think about and I'm left wondering if the pitfalls are better or worse than the current state. It would help if others joined in the review with their thoughts about it.
-melanie
Hi Sean Mooney and melanir, thanks for mentioning. This ^ is for cleanup orphans. For imcomplete migations, you prefer not destroying them, right? I'm not sure about it either. But I gave a possible solution on the etherpad (set instance.host and apply/revert migration context and then invoke cleanup_running_deleted_instances to cleanup the instance). And before cleanup done, we need track these instances/migrations in RT, need more people join our discussion. Welcome put your suggestion on the etherpad. https://etherpad.openstack.org/p/track-err-migr-and-orphans-in-RT. Thanks in advance. BR, Luyao
participants (6)
-
Alex Xu
-
Luyao Zhong
-
Matt Riedemann
-
melanie witt
-
Sean Mooney
-
Zhong, Luyao