[nova] track error migrations and orphans in Resource Tracker

melanie witt melwittt at gmail.com
Wed Nov 13 19:43:56 UTC 2019

On 11/12/19 05:18, Sean Mooney wrote:
> On Tue, 2019-11-12 at 05:46 +0000, Zhong, Luyao wrote:
>> Hi Nova experts,
>> "Not tracking error migrations and orphans in RT." is probably a bug. This may trigger some problems in
>> update_available_resources in RT at the moment. That is some orphans or error migrations are using cpus/memory/disk
>> etc, but we don't take these usage into consideration. And instance.resources is introduced from Train used to contain
>> specific resources, we also track assigned specific resources in RT based on tracked migrations and instances. So this
>> bug will also affect the specific resources tracking.
>> I draft an doc to clarify this bug and possible solutions:
>> https://etherpad.openstack.org/p/track-err-migr-and-orphans-in-RT
>> Looking forward to suggestions from you. Thanks in advance.
> there are patche up to allow cleaning up orpahn instances
> https://review.opendev.org/#/c/627765/
> https://review.opendev.org/#/c/648912/
> if we can get those merged that woudl adress at least some of the proablem

I just wanted to mention:

I have reviewed the cleanup patches ^ multiple times and I'm having a 
hard time getting past the fact that any way you slice it (AFAICT), the 
cleanup code will have a window where a valid guest could be destroyed 
erroneously (not an orphan). This is because the "get instance list by 
host" can miss instances that are mid-migration, because of how/where we 
update the instance.host field.

Maybe this ^ could be acceptable (?) if we put a big fat warning on the 
config option help for 'reap_unknown'. But I was unsure of the answers 
about what recovery looks like in case a guest is erroneously destroyed 
for an instance that is in the middle of migrating. In the case of 
resize or cold migrate, a hard reboot would fix it AFAIK. What about for 
a live migration? If recovery is possible in every case, those would 
also need to be documented in the config option help for 'reap_unknown'.

The patch has lots of complexities to think about and I'm left wondering 
if the pitfalls are better or worse than the current state. It would 
help if others joined in the review with their thoughts about it.


More information about the openstack-discuss mailing list