[nova] track error migrations and orphans in Resource Tracker

Luyao Zhong luyao.zhong at intel.com
Thu Nov 14 02:33:18 UTC 2019

On 2019/11/14 上午3:43, melanie witt wrote:
> On 11/12/19 05:18, Sean Mooney wrote:
>> On Tue, 2019-11-12 at 05:46 +0000, Zhong, Luyao wrote:
>>> Hi Nova experts,
>>> "Not tracking error migrations and orphans in RT." is probably a bug. 
>>> This may trigger some problems in
>>> update_available_resources in RT at the moment. That is some orphans 
>>> or error migrations are using cpus/memory/disk
>>> etc, but we don't take these usage into consideration. And 
>>> instance.resources is introduced from Train used to contain
>>> specific resources, we also track assigned specific resources in RT 
>>> based on tracked migrations and instances. So this
>>> bug will also affect the specific resources tracking.
>>> I draft an doc to clarify this bug and possible solutions:
>>> https://etherpad.openstack.org/p/track-err-migr-and-orphans-in-RT
>>> Looking forward to suggestions from you. Thanks in advance.
>> there are patche up to allow cleaning up orpahn instances
>> https://review.opendev.org/#/c/627765/
>> https://review.opendev.org/#/c/648912/
>> if we can get those merged that woudl adress at least some of the 
>> proablem
> I just wanted to mention:
> I have reviewed the cleanup patches ^ multiple times and I'm having a 
> hard time getting past the fact that any way you slice it (AFAICT), the 
> cleanup code will have a window where a valid guest could be destroyed 
> erroneously (not an orphan). This is because the "get instance list by 
> host" can miss instances that are mid-migration, because of how/where we 
> update the instance.host field.
> Maybe this ^ could be acceptable (?) if we put a big fat warning on the 
> config option help for 'reap_unknown'. But I was unsure of the answers 
> about what recovery looks like in case a guest is erroneously destroyed 
> for an instance that is in the middle of migrating. In the case of 
> resize or cold migrate, a hard reboot would fix it AFAIK. What about for 
> a live migration? If recovery is possible in every case, those would 
> also need to be documented in the config option help for 'reap_unknown'.
> The patch has lots of complexities to think about and I'm left wondering 
> if the pitfalls are better or worse than the current state. It would 
> help if others joined in the review with their thoughts about it.
> -melanie

Hi Sean Mooney and melanir, thanks for mentioning.
This ^ is for cleanup orphans. For imcomplete migations, you prefer not 
destroying them, right? I'm not sure about it either. But I gave a 
possible solution on the etherpad (set instance.host and apply/revert 
migration context and then invoke cleanup_running_deleted_instances to 
cleanup the instance).

And before cleanup done, we need track these instances/migrations in RT,
need more people join our discussion. Welcome put your suggestion on the 
etherpad. https://etherpad.openstack.org/p/track-err-migr-and-orphans-in-RT.

Thanks in advance.


