[nova] The pros/cons for libvirt persistent assignment and DB persistent assignment.
We get a lot of discussion on how to do the claim for the vpmem. There are a few points we are trying to match: * Avoid race problem. (the current VGPU assignment has been found having race issue https://launchpad.net/bugs/1836204) * Avoid the device assignment management to be virt driver and platform-specific. * Keep it simple. Currently, we go through two solutions here. This email is going to summary the pros/cons of these two solutions. #1 Without Nova DB persistent for the assignment info, depends on hypervisor persistent it. The idea is adding VirtDriver.claim/unclaim_for_instance(instance_uuid, flavor_id) interface. The assignment info is populated from hypervisor when nova-compute startup. And keep in the memory of VirtDriver. The instance_uuid is used to distinguish the claim from the different instance. The flavor_id is used for the same host resize, to distinguish the claim for source and target. This virt driver method is being invoked inside ResourceTracker to avoid the race problem. There is no any nova DB persistent for the assignment info. https://review.opendev.org/#/q/status:open+project:openstack/nova+branch:mas... pros: * Hidden all the device detail and virt driver detail inside the virt driver. * Less upgrade issue in the future since it doesn't involve any nova DB model change * Expecting as simple implementation since everything inside virt driver. cons: * Two cases are being found, the domain XML being lost for Libvirt virt driver. And we don't know other hypervisor behavior yet. * For the same host resize, the source and target instance are sharing single one domain XML. After the libvirt virt driver updated the domain XML to the target instance, the source instance's assignment information will be lost when a nova-compute restart happened. That means the resized instance can't be revert, the only choice for the user is to confirm the resize. * For live migration, the target host's domain XML will be cleanup by libvirt after a host restart. The assignment information is lost before nova-compute startup and doing a cleanup. * Can not support the same host cold migration. Since we need a way to identify the source and target instance's assignment in memory. But the same host cold migration means the same instance UUID and same flavor ID, there isn't another else can be used to distinguish the assignment. * There are workarounds added for above points, the code becomes fragile. #2 With nova DB persistent, but using virt driver specific blob to store virt driver specific info. The idea is persistent the assignment for instance into DB. The resource tracker gets available resources from virt driver. The resource tracker will calculate on the fly based on available resources and assigned resources from instance DB. The new field ·instance.resources· is designed for supporting virt driver specific metadata, then hidden the virt driver and platform detail from RT. https://etherpad.openstack.org/p/vpmems-non-virt-driver-specific-new pros: * Persistent assignment into instance object. Avoid the corner case we lost the assignment. * The ResourceTracker is responsible for doing the claim job. This is more reliable and no race problem, since ResourceTracker works very well for a long time. * The virt driver specific json-blob hidden the virt driver/platform detail from the ResourceTracker. * The free resource is calculated on the fly, keeping the implementation simple. Actually, the RT just provides a point to do the claim, needn't involve the complex of RT.update_available_resources cons: * Doesn't like PCIManager which has both instance side and host side persistent info. On the fly calculation should take care of the orphaned instance(the instance is deleted from DB, but still existing on the host), so actually, it isn't unresolvable issue. And it isn't too hard to upgrade to have host side persistent info in the future if we want. * Data model change for the original proposal. Need review to decide the data model enough generic Currently, Sean, Eric and I prefer the #2 now since the #1 has flaws for the same host resize and live migration can't be skipped by design. Looking for more feedback, and will appreciate it! Thanks Alex
Alex- Thanks for writing this up.
#1 Without Nova DB persistent for the assignment info, depends on hypervisor persistent it.
I liked the "no persistence" option in theory, but it unfortunately turned out to be too brittle when it came to the corner cases.
#2 With nova DB persistent, but using virt driver specific blob to store virt driver specific info.
The idea is persistent the assignment for instance into DB. The resource tracker gets available resources from virt driver. The resource tracker will calculate on the fly based on available resources and assigned resources from instance DB. The new field ·instance.resources· is designed for supporting virt driver specific metadata, then hidden the virt driver and platform detail from RT. https://etherpad.openstack.org/p/vpmems-non-virt-driver-specific-new
I just took a closer look at this, and I really like it. Persisting local resource information with the Instance and MigrationContext objects ensures we don't lose it in weird corner cases, regardless of a specific hypervisor's "persistence model" (e.g. domain XML for libvirt). MigrationContext is already being used for this old_* new_* concept - but the existing fields are hypervisor-specific (numa and pci). Storing this information in a generic, opaque-outside-of-virt way means we're not constantly bolting hypervisor-specific fields onto what *should* be non-hypervisor-specific objects. As you've stated in the etherpad, this framework sets us up nicely to start transitioning existing PCI/NUMA-isms over to a Placement-driven model in the near future. Having the virt driver report provider tree (placement-specific) and "real" (hypervisor-specific) resource information at the same time makes all kinds of sense. So, quite aside from solving the stated race condition and enabling vpmem, all of this is excellent movement toward the "generic device (resource) management" we've been talking about for years. Let's make it so. efried .
On Wed, 2019-08-21 at 14:51 -0500, Eric Fried wrote:
Alex-
Thanks for writing this up.
#1 Without Nova DB persistent for the assignment info, depends on hypervisor persistent it.
I liked the "no persistence" option in theory, but it unfortunately turned out to be too brittle when it came to the corner cases.
#2 With nova DB persistent, but using virt driver specific blob to store virt driver specific info.
The idea is persistent the assignment for instance into DB. The resource tracker gets available resources from virt driver. The resource tracker will calculate on the fly based on available resources and assigned resources from instance DB. The new field ·instance.resources· is designed for supporting virt driver specific metadata, then hidden the virt driver and platform detail from RT. https://etherpad.openstack.org/p/vpmems-non-virt-driver-specific-new
I just took a closer look at this, and I really like it.
Persisting local resource information with the Instance and MigrationContext objects ensures we don't lose it in weird corner cases, regardless of a specific hypervisor's "persistence model" (e.g. domain XML for libvirt).
MigrationContext is already being used for this old_* new_* concept - but the existing fields are hypervisor-specific (numa and pci).
Storing this information in a generic, opaque-outside-of-virt way means we're not constantly bolting hypervisor-specific fields onto what *should* be non-hypervisor-specific objects.
As you've stated in the etherpad, this framework sets us up nicely to start transitioning existing PCI/NUMA-isms over to a Placement-driven model in the near future.
Having the virt driver report provider tree (placement-specific) and "real" (hypervisor-specific) resource information at the same time makes all kinds of sense.
So, quite aside from solving the stated race condition and enabling vpmem, all of this is excellent movement toward the "generic device (resource) management" we've been talking about for years.
Let's make it so. i agree with most of what erric said above and the content of the etherpad. i left a couple of comment inline but i also dont want to pollute it too much with comments so i will summrise some addtional thoughts here.
tl;dr i think this would allow us to converge tracking and assignment of vpmem, vgpus, pci devices, and pcpus each of the these resoeces requrie nova to do assignment of sepcfic host device and that can be done genericly in some cases and delegetad to the driver in other via this proposal. the simple case of vpmem usage of this is relitvly self contained but the use of it for other reseoce will require though and work to enabled. more detailed thoughts. 1.) short term i belive host side tracking is not needed for vPMEM or vGPU 2.) medium term having host side tracking of resouces might simplfy vPMEM,vGPU,pCPU and PCI tracking 3.) long term i think if we use placement correctly and have instace level tracking we might not need host side tracking at all. 3.a) instance level tracking will allow use to reliably compute the host side view in the virt driver form config and device discovery. 3.b) with netsted resocue providers and the new abbitly to do nested queires we can move filtering mostly to placemnet. 4.) we use a host/instance numa toplogy blob for mempages(hugepages) today, if we model them in plamcent i dont think we will need host side tracking for filtering.(see note on weighing later) 4.a) if we have pcpus and mempages as childern of cache regions or numa nodes we can do numa/cache afinity of those resouce and pci deivce using same_subtree or whatever it ended up being called in placemetn. 4.b) hugepages are currenly not assigned by nova we just do a tally count of how many are free on indvigual numa nodes of a given size and select a numa node which i think can entirely be done via placment as of about 5-6 weeks ago. the asignem is done by the kernel which is why we dont need to track indivigual hugepages at the host level. 5.) if we dont have host side tracking we cannot do classic weighing of local resocues as we do not have teh data 6.) if we pass allocation candiates to the filters instead of hosts we can replace our exisitng filters with plamcent aware filters that can use the placement tree sturcture and traits to weight the possible allocation candiates which will inturn weigh the hosts. 7.) pcpus unlike hugepages are assigned by nova and would need to be track in memroy at the host level. this host view could be computed by the virt driver if we track the assignemt in the instace and migrations but host side track would be simplier to port the existing code too. pcpus would need to do the assignment within the driver from teh free resouce retrun by the resouce tracker. 8.) this might move some eof the logic form nova/virt/hardware.py to the libvirt dirver where i proably should always have been. 8.a) the validate of flavor exctra in nova/virt/hardware.py that is used in the api would not be moved to the driver. regards sean.
efried .
On 8/21/2019 1:59 AM, Alex Xu wrote:
We get a lot of discussion on how to do the claim for the vpmem. There are a few points we are trying to match:
* Avoid race problem. (the current VGPU assignment has been found having race issue https://launchpad.net/bugs/1836204) * Avoid the device assignment management to be virt driver and platform-specific. * Keep it simple.
Currently, we go through two solutions here. This email is going to summary the pros/cons of these two solutions.
#1 Without Nova DB persistent for the assignment info, depends on hypervisor persistent it.
The idea is adding VirtDriver.claim/unclaim_for_instance(instance_uuid, flavor_id) interface. The assignment info is populated from hypervisor when nova-compute startup. And keep in the memory of VirtDriver. The
Is there any reason the device assignment in-memory mapping has to be in the virt driver and not, for example, the ResourceTracker itself? This becomes important below.
instance_uuid is used to distinguish the claim from the different instance. The flavor_id is used for the same host resize, to distinguish the claim for source and target. This virt driver method is being invoked inside ResourceTracker to avoid the race problem. There is no any nova DB persistent for the assignment info. https://review.opendev.org/#/q/status:open+project:openstack/nova+branch:mas...
pros: * Hidden all the device detail and virt driver detail inside the virt driver. * Less upgrade issue in the future since it doesn't involve any nova DB model change * Expecting as simple implementation since everything inside virt driver. cons: * Two cases are being found, the domain XML being lost for Libvirt virt driver. And we don't know other hypervisor behavior yet.
How do we "lose" the domain xml? I guess your next points are examples?
* For the same host resize, the source and target instance are sharing single one domain XML. After the libvirt virt driver updated the domain XML to the target instance, the source instance's assignment information will be lost when a nova-compute restart happened. That means the resized instance can't be revert, the only choice for the user is to confirm the resize.
As discussed with Dan and me in IRC a week or two ago, we suggested you could do the same migration-based allocation switch for move operations as we do for cold migrate, resize and live migration since Queens, where the source node allocations are consumed by the migration record and the target node allocations are consumed by the instance. The conductor swaps the source node allocations before calling the scheduler which will create the target node allocations with the instance. On confirm/revert we either drop the source node allocations (held by the migration) or swap them back (and drop the target node allocations held by the instance). In your device case, clearly conductor and placement isn't involved since we're not tracking those low-level details in placement. Placement just knows there is a certain amount of some resource class but not which consumers are actually assigned which devices on the hypervisor (like pci device management). But as far as keeping track of the assignments in memory, we could still do the same swap where the migration record is tracking the old flavor device assignments (in the virt driver or resource tracker) and the instance record is tracking the new flavor device assignments. That resolves the same-host resize case, correct? Doing it generically in the ResourceTracker is why I asked about doing that above in the RT rather than the driver. What that doesn't solve is restarts of the compute service while there is a pending resize, which is why we need to persist some information somewhere. We could use the domain xml if it contained the flavor id, but it doesn't - and for same-host resize we only have one domain xml so that's not really an option (as you've noted).
* For live migration, the target host's domain XML will be cleanup by libvirt after a host restart. The assignment information is lost before nova-compute startup and doing a cleanup.
I'm not really following you here. This is not an expected situation, correct? Meaning the target compute service is restarted while there is an in-progress live migration? I imagine if that happens we have lots of problems and most (manual) recovery procedures are going to involve the operator trying to destroy the guest and it's related resources from the target host and hard rebooting to recover the guest on the source host.
* Can not support the same host cold migration. Since we need a way to identify the source and target instance's assignment in memory. But the same host cold migration means the same instance UUID and same flavor ID, there isn't another else can be used to distinguish the assignment.
The only in-tree virt driver that supports cold migrating on the same compute service host is the vmware driver, and that does not support things like VGPUs or VPMEMs, so I'm not sure why cold migration on the same host is a concern here - it's not supported and no one is working on adding that support.
* There are workarounds added for above points, the code becomes fragile.
To summarize, it sounds like the biggest problem is the lack of persistence during a same-host resize because we'd lost the in-memory device assignment tracking, even if we did the migration-based allocation swap magic as described above. Could we have a compromise where for all times *except* during some migration, we get the assigned devices from the hypervisor, but otherwise during a migration we store the old/new assignments in the MigrationContext? That would give us the persistence we need and would only be something that we temporarily care about during a migration. The thing I'm not sure about is if we do that, does it make things more complicated in general for the non-migration cases, or if we do it should we just go the extra mile and always be tracking assigned devices in the database exactly like what we do for PCI devices today - meaning we wouldn't have a special edge case just for migrations with these types of resources.
#2 With nova DB persistent, but using virt driver specific blob to store virt driver specific info.
The idea is persistent the assignment for instance into DB. The resource tracker gets available resources from virt driver. The resource tracker will calculate on the fly based on available resources and assigned resources from instance DB. The new field ·instance.resources· is designed for supporting virt driver specific metadata, then hidden the virt driver and platform detail from RT. https://etherpad.openstack.org/p/vpmems-non-virt-driver-specific-new
I left some comments in the etherpad about the proposed claims process but the "on the fly" part concerns me for performance, especially if we don't make that conditional based on the types of resources we're claiming. During a claim the ResourceTracker already has the list of tracked_instances and tracked_migrations it cares about, but it sounds like you're proposing that we would also now have to re-fetch all of that data from the database just to get the resources and migration context information for any instances tracked by that host to determine what their assignments are. That seems really heavy-weight to me and is my major concern with this approach, well, that and the fact it sounds like we're creating a new version of the PCIManager (though more generic, it could have a lot of the same split brain type issues we've had with tracking PCI device inventory and allocations over the years since it was introduced; by split brain I mean the hypervisor saying one thing but nova thinking another).
pros: * Persistent assignment into instance object. Avoid the corner case we lost the assignment. * The ResourceTracker is responsible for doing the claim job. This is more reliable and no race problem, since ResourceTracker works very well for a long time.
Heh, I guess yeah. :) There are a lot of dragons in that code and we're still fixing bugs in it even though it should be mostly stable after all of these years. But resource tracking in general sucks regardless of where it happens (RT, placement or the virt driver) so we just have to be comfortable with knowing there are going to be dragons.
* The virt driver specific json-blob hidden the virt driver/platform detail from the ResourceTracker.
Random json blobs are nasty in general especially if we need to convert data at runtime later for some upgrade purpose. What is proposed in the etherpad seems OK(ish) though given the only very random thing is the 'metadata' field, but I could see that all getting confusing to maintain later when we have different schema/semantic rules about what's in the metadata depending on the resource class and virt driver. But we'll likely have that problem anyway if we go with the non-persistent option #1 above.
* The free resource is calculated on the fly, keeping the implementation simple. Actually, the RT just provides a point to do the claim, needn't involve the complex of RT.update_available_resources cons: * Doesn't like PCIManager which has both instance side and host side persistent info. On the fly calculation should take care of the orphaned instance(the instance is deleted from DB, but still existing on the host), so actually, it isn't unresolvable issue. And it isn't too hard to upgrade to have host side persistent info in the future if we want. * Data model change for the original proposal. Need review to decide the data model enough generic
Currently, Sean, Eric and I prefer the #2 now since the #1 has flaws for the same host resize and live migration can't be skipped by design.
At this point I can't say I have a strong opinion. I think either approach is going to be complicated and buggy and hard to maintain, especially if we don't have CI for these more exotic scenarios (which we don't for VGPU or VPMEM even though you said someone is working on the latter). I've voiced my concerns here but I'm not going to "die on a hill" for this, so in the end I'll likely roll over for whatever those of you that really care about this want to do, and know that you're going to be maintainers of it. -- Thanks, Matt
Matt Riedemann <mriedemos@gmail.com> 于2019年8月23日周五 上午5:53写道:
On 8/21/2019 1:59 AM, Alex Xu wrote:
We get a lot of discussion on how to do the claim for the vpmem. There are a few points we are trying to match:
* Avoid race problem. (the current VGPU assignment has been found having race issue https://launchpad.net/bugs/1836204) * Avoid the device assignment management to be virt driver and platform-specific. * Keep it simple.
Currently, we go through two solutions here. This email is going to summary the pros/cons of these two solutions.
#1 Without Nova DB persistent for the assignment info, depends on hypervisor persistent it.
The idea is adding VirtDriver.claim/unclaim_for_instance(instance_uuid, flavor_id) interface. The assignment info is populated from hypervisor when nova-compute startup. And keep in the memory of VirtDriver. The
Is there any reason the device assignment in-memory mapping has to be in the virt driver and not, for example, the ResourceTracker itself? This becomes important below.
We will answer this below. It is about whether using migration allocation make sense or not.
instance_uuid is used to distinguish the claim from the different instance. The flavor_id is used for the same host resize, to distinguish the claim for source and target. This virt driver method is being invoked inside ResourceTracker to avoid the race problem. There is no any nova DB persistent for the assignment info.
https://review.opendev.org/#/q/status:open+project:openstack/nova+branch:mas...
pros: * Hidden all the device detail and virt driver detail inside the virt driver. * Less upgrade issue in the future since it doesn't involve any nova DB model change * Expecting as simple implementation since everything inside virt driver. cons: * Two cases are being found, the domain XML being lost for Libvirt virt driver. And we don't know other hypervisor behavior yet.
How do we "lose" the domain xml? I guess your next points are examples?
* For the same host resize, the source and target instance are sharing single one domain XML. After the libvirt virt driver updated the domain XML to the target instance, the source instance's assignment information will be lost when a nova-compute restart happened. That means the resized instance can't be revert, the only choice for the user is to confirm the resize.
As discussed with Dan and me in IRC a week or two ago, we suggested you could do the same migration-based allocation switch for move operations as we do for cold migrate, resize and live migration since Queens, where the source node allocations are consumed by the migration record and the target node allocations are consumed by the instance. The conductor swaps the source node allocations before calling the scheduler which will create the target node allocations with the instance. On confirm/revert we either drop the source node allocations (held by the migration) or swap them back (and drop the target node allocations held by the instance).
In your device case, clearly conductor and placement isn't involved since we're not tracking those low-level details in placement. Placement just knows there is a certain amount of some resource class but not which consumers are actually assigned which devices on the hypervisor (like pci device management). But as far as keeping track of the assignments in memory, we could still do the same swap where the migration record is tracking the old flavor device assignments (in the virt driver or resource tracker) and the instance record is tracking the new flavor device assignments. That resolves the same-host resize case, correct? Doing it generically in the ResourceTracker is why I asked about doing that above in the RT rather than the driver.
What that doesn't solve is restarts of the compute service while there is a pending resize, which is why we need to persist some information somewhere. We could use the domain xml if it contained the flavor id, but it doesn't - and for same-host resize we only have one domain xml so that's not really an option (as you've noted).
Actually, there are two problems here, let us talk about it separately: 1. lost allocation info after compute service restart for the same host resize This is the about above point. It is nothing about using migration allocation or using instance_uuid + flavor_id. It only can be fixed by DB persistent, also as you said later about persistent in MigrationContext. I will explain that later. 2. Supporting same host cold migration This is the point I said below. For the same host resize, instance_uuid + flavor_id is working very well. But it can't support the same host cold migration. And yes, migration allocation can fix it. But also as you said, do we need to support the same host cold migration? If the answer is no, then we needn't bother it. instance_uuid + flavor_id is much simple. If the answer is yes, right, we can put it into the RT. But it will be complex, maybe we need a data model like the DB way proposal to pass the virt driver/platform specific info between RT and virt driver. Also, think about the case, we need to check if there is any incomplete live-migration, we need to do a cleanup for all free vpmems, since we lost the allocation info for live-migration. Then we need a virt dirver interface to trigger that cleanup, pretty sure I don't want to call it as driver.cleanup_vpmems(). We also need to change the existing driver.spawn method, to pass the assigned resource into virt driver. Also thinking about the case of interrupted migration, I guess there is no way to switch the I also remember Dan said, it isn't good to not support same host cold migration.
* For live migration, the target host's domain XML will be cleanup by libvirt after a host restart. The assignment information is lost before nova-compute startup and doing a cleanup.
I'm not really following you here. This is not an expected situation, correct? Meaning the target compute service is restarted while there is an in-progress live migration? I imagine if that happens we have lots of problems and most (manual) recovery procedures are going to involve the operator trying to destroy the guest and it's related resources from the target host and hard rebooting to recover the guest on the source host.
It is more terrible, the restart of nova-compute will just set the instance back to active status https://github.com/openstack/nova/blob/62f6a0a1bc6c4b24621e1c2e927177f99501b... And leaving the target host without any cleanup. Also in the LM rollback method, we set the instance back to action in the very beginning, if the compute restart before actual cleanup, then the target won't be clean up also. https://github.com/openstack/nova/blob/62f6a0a1bc6c4b24621e1c2e927177f99501b... We shouldn't set the instance back to active when there is migration isn't clean up. Those are existing bugs, and we should fix it. Whatever the solution we choose, it won't be the thing can be fixed automatically with new solution.
* Can not support the same host cold migration. Since we need a way to identify the source and target instance's assignment in memory. But the same host cold migration means the same instance UUID and same flavor ID, there isn't another else can be used to distinguish the assignment.
The only in-tree virt driver that supports cold migrating on the same compute service host is the vmware driver, and that does not support things like VGPUs or VPMEMs, so I'm not sure why cold migration on the same host is a concern here - it's not supported and no one is working on adding that support.
* There are workarounds added for above points, the code becomes fragile.
To summarize, it sounds like the biggest problem is the lack of persistence during a same-host resize because we'd lost the in-memory device assignment tracking, even if we did the migration-based allocation swap magic as described above.
Exactly
Could we have a compromise where for all times *except* during some migration, we get the assigned devices from the hypervisor, but otherwise during a migration we store the old/new assignments in the MigrationContext? That would give us the persistence we need and would only be something that we temporarily care about during a migration. The thing I'm not sure about is if we do that, does it make things more complicated in general for the non-migration cases, or if we do it should we just go the extra mile and always be tracking assigned devices in the database exactly like what we do for PCI devices today - meaning we wouldn't have a special edge case just for migrations with these types of resources.
Then the only difference with DB persistent way is that store the allocation on the "Instance.resources" also. If we do that one more step, then we needn't change our virt driver interface, and thinking about how to switch the consumer from migration back to instance. Which are the complex I said above.
#2 With nova DB persistent, but using virt driver specific blob to store virt driver specific info.
The idea is persistent the assignment for instance into DB. The resource tracker gets available resources from virt driver. The resource tracker will calculate on the fly based on available resources and assigned resources from instance DB. The new field ·instance.resources· is designed for supporting virt driver specific metadata, then hidden the virt driver and platform detail from RT. https://etherpad.openstack.org/p/vpmems-non-virt-driver-specific-new
I left some comments in the etherpad about the proposed claims process but the "on the fly" part concerns me for performance, especially if we don't make that conditional based on the types of resources we're claiming. During a claim the ResourceTracker already has the list of tracked_instances and tracked_migrations it cares about, but it sounds like you're proposing that we would also now have to re-fetch all of that data from the database just to get the resources and migration context information for any instances tracked by that host to determine what their assignments are. That seems really heavy-weight to me and is my major concern with this approach, well, that and the fact it sounds like we're creating a new version of the PCIManager (though more generic, it could have a lot of the same split brain type issues we've had with tracking PCI device inventory and allocations over the years since it was introduced; by split brain I mean the hypervisor saying one thing but nova thinking another).
I think you are right, we can use RT.tracked_instances and RT.tracked_migrations. Then it isn't on the fly anymore. There are two existing bugs should be fixed. 1. The orphaned instance isn't in RT.tracked_instance. Although there is resource consuming for orphaned instance https://github.com/openstack/nova/blob/62f6a0a1bc6c4b24621e1c2e927177f99501b..., the virt driver interface https://github.com/openstack/nova/blob/62f6a0a1bc6c4b24621e1c2e927177f99501b... doesn't implement by most of virt driver. 2. The error status migration isn't in RT.tracked_migration. The resize may interrupt in the middle. Then we set the migration to an error status. Although we have a _clean_incomplete_migration periodic task to cleanup those error migration, there is a window between cleanup, the RT doesn't count the resource consuming. Those are existing bugs, and easy to be fixed. That is why I use the on-the-fly in the beginning, but I agree, those bugs are easy to fix, and the code will begin more tidy. For the split-brain problem, to be honest, the domain XML way shows to us, it can't fix it also. It lost the allocation for the same host resize and live migration.
pros: * Persistent assignment into instance object. Avoid the corner case we lost the assignment. * The ResourceTracker is responsible for doing the claim job. This is more reliable and no race problem, since ResourceTracker works very well for a long time.
Heh, I guess yeah. :) There are a lot of dragons in that code and we're still fixing bugs in it even though it should be mostly stable after all of these years. But resource tracking in general sucks regardless of where it happens (RT, placement or the virt driver) so we just have to be comfortable with knowing there are going to be dragons.
I already list the bug above, I think the problem is we missing some tracking and doesn't have a close loop for the instance and migration status. I add my analyze in the bottom of the etherpad. https://etherpad.openstack.org/p/vpmems-non-virt-driver-specific-new
* The virt driver specific json-blob hidden the virt driver/platform detail from the ResourceTracker.
Random json blobs are nasty in general especially if we need to convert data at runtime later for some upgrade purpose. What is proposed in the etherpad seems OK(ish) though given the only very random thing is the 'metadata' field, but I could see that all getting confusing to maintain later when we have different schema/semantic rules about what's in the metadata depending on the resource class and virt driver. But we'll likely have that problem anyway if we go with the non-persistent option #1 above.
It is a JOSN blob which dump from versioned object, so it should be OK?
* The free resource is calculated on the fly, keeping the implementation simple. Actually, the RT just provides a point to do the claim, needn't involve the complex of RT.update_available_resources cons: * Doesn't like PCIManager which has both instance side and host side persistent info. On the fly calculation should take care of the orphaned instance(the instance is deleted from DB, but still existing on the host), so actually, it isn't unresolvable issue. And it isn't too hard to upgrade to have host side persistent info in the future if we want. * Data model change for the original proposal. Need review to decide the data model enough generic
Currently, Sean, Eric and I prefer the #2 now since the #1 has flaws for the same host resize and live migration can't be skipped by design.
At this point I can't say I have a strong opinion. I think either approach is going to be complicated and buggy and hard to maintain, especially if we don't have CI for these more exotic scenarios (which we don't for VGPU or VPMEM even though you said someone is working on the latter). I've voiced my concerns here but I'm not going to "die on a hill" for this, so in the end I'll likely roll over for whatever those of you that really care about this want to do, and know that you're going to be maintainers of it.
If you worry about the VPMEM itself, then Rui is working on CI, he said he needs two weeks before he has done the work. We can ask him give an update at here if you want. If you worry about the RT part, I think we can have functional test to cover that? I won't say the DB way is complicated, the most of code in RT is about to get the assigned resource from tracked_instance and tracked_migration, then compare to the available resource. The buggy is existing nova bug. It isn't the fault of the proposal. I don't know what the maintain problem point to. it will be great we have a specific case to discuss.
--
Thanks,
Matt
On 8/23/2019 3:43 AM, Alex Xu wrote:
2. Supporting same host cold migration
This is the point I said below. For the same host resize, instance_uuid + flavor_id is working very well. But it can't support the same host cold migration. And yes, migration allocation can fix it. But also as you said, do we need to support the same host cold migration?
I see no reason to try and bend over backward to support same host cold migration since, as I said, the only virt driver that supports that today (and has been the only one for a long time - maybe forever?) is the vmware driver which isn't supporting any of these more advanced flows (VGPU, VPMEM, PCPU).
If the answer is no, then we needn't bother it. instance_uuid + flavor_id is much simple. If the answer is yes, right, we can put it into the RT. But it will be complex, maybe we need a data model like the DB way proposal to pass the virt driver/platform specific info between RT and virt driver. Also, think about the case, we need to check if there is any incomplete live-migration, we need to do a cleanup for all free vpmems, since we lost the allocation info for live-migration. Then we need a virt dirver interface to trigger that cleanup, pretty sure I don't want to call it as driver.cleanup_vpmems(). We also need to change the existing driver.spawn method, to pass the assigned resource into virt driver. Also thinking about the case of interrupted migration, I guess there is no way to switch the
I also remember Dan said, it isn't good to not support same host cold migration.
Again, the libvirt driver, as far as I know, has never supported same host cold migration, nor is anyone working on that, so I don't see where the need to make that support happen now is coming from. I think it should be ignored for the sake of these conversations.
I think you are right, we can use RT.tracked_instances and RT.tracked_migrations. Then it isn't on the fly anymore. There are two existing bugs should be fixed.
1. The orphaned instance isn't in RT.tracked_instance. Although there is resource consuming for orphaned instance https://github.com/openstack/nova/blob/62f6a0a1bc6c4b24621e1c2e927177f99501b..., the virt driver interface https://github.com/openstack/nova/blob/62f6a0a1bc6c4b24621e1c2e927177f99501b... doesn't implement by most of virt driver.
For the latter, get_per_instance_usage, that's only implemented by the xenapi driver which is on the path to being deprecated by the end of Train at this point anyway: https://review.opendev.org/#/c/662295/ so I wouldn't worry too much about that one. <snip> In summary I'm not going to block attempts at proposal #2. As you said, there are existing bugs which should be handled, though some likely won't ever be completely fixed (automatic cleanup and recovery from live migration failures - the live migration methods are huge and have a lot of points of failure, so properly rolling back from all of those is going to be a big undertaking in test and review time, and I don't see either happening at this stage). I think one of the motivations to keep VPMEM resource tracking isolated to the hypervisor was just to get something quick and dirty working with a minimal amount of impact to other parts of nova, like the data model, ResourceTracker, etc. If proposal #2 also solves issues for VGPUs and PCPUs then there is more justification for doing it. Either way I'm not opposed to the #2 proposal so if that's what the people that are working on this want, go ahead. I personally don't plan on investing much review time in this series either way though, so that's kind of why I'm apathetic about this. -- Thanks, Matt
participants (4)
-
Alex Xu
-
Eric Fried
-
Matt Riedemann
-
Sean Mooney