On Wed, 2019-08-21 at 14:51 -0500, Eric Fried wrote:
Alex-
Thanks for writing this up.
#1 Without Nova DB persistent for the assignment info, depends on hypervisor persistent it.
I liked the "no persistence" option in theory, but it unfortunately turned out to be too brittle when it came to the corner cases.
#2 With nova DB persistent, but using virt driver specific blob to store virt driver specific info.
The idea is persistent the assignment for instance into DB. The resource tracker gets available resources from virt driver. The resource tracker will calculate on the fly based on available resources and assigned resources from instance DB. The new field ·instance.resources· is designed for supporting virt driver specific metadata, then hidden the virt driver and platform detail from RT. https://etherpad.openstack.org/p/vpmems-non-virt-driver-specific-new
I just took a closer look at this, and I really like it.
Persisting local resource information with the Instance and MigrationContext objects ensures we don't lose it in weird corner cases, regardless of a specific hypervisor's "persistence model" (e.g. domain XML for libvirt).
MigrationContext is already being used for this old_* new_* concept - but the existing fields are hypervisor-specific (numa and pci).
Storing this information in a generic, opaque-outside-of-virt way means we're not constantly bolting hypervisor-specific fields onto what *should* be non-hypervisor-specific objects.
As you've stated in the etherpad, this framework sets us up nicely to start transitioning existing PCI/NUMA-isms over to a Placement-driven model in the near future.
Having the virt driver report provider tree (placement-specific) and "real" (hypervisor-specific) resource information at the same time makes all kinds of sense.
So, quite aside from solving the stated race condition and enabling vpmem, all of this is excellent movement toward the "generic device (resource) management" we've been talking about for years.
Let's make it so. i agree with most of what erric said above and the content of the etherpad. i left a couple of comment inline but i also dont want to pollute it too much with comments so i will summrise some addtional thoughts here.
tl;dr i think this would allow us to converge tracking and assignment of vpmem, vgpus, pci devices, and pcpus each of the these resoeces requrie nova to do assignment of sepcfic host device and that can be done genericly in some cases and delegetad to the driver in other via this proposal. the simple case of vpmem usage of this is relitvly self contained but the use of it for other reseoce will require though and work to enabled. more detailed thoughts. 1.) short term i belive host side tracking is not needed for vPMEM or vGPU 2.) medium term having host side tracking of resouces might simplfy vPMEM,vGPU,pCPU and PCI tracking 3.) long term i think if we use placement correctly and have instace level tracking we might not need host side tracking at all. 3.a) instance level tracking will allow use to reliably compute the host side view in the virt driver form config and device discovery. 3.b) with netsted resocue providers and the new abbitly to do nested queires we can move filtering mostly to placemnet. 4.) we use a host/instance numa toplogy blob for mempages(hugepages) today, if we model them in plamcent i dont think we will need host side tracking for filtering.(see note on weighing later) 4.a) if we have pcpus and mempages as childern of cache regions or numa nodes we can do numa/cache afinity of those resouce and pci deivce using same_subtree or whatever it ended up being called in placemetn. 4.b) hugepages are currenly not assigned by nova we just do a tally count of how many are free on indvigual numa nodes of a given size and select a numa node which i think can entirely be done via placment as of about 5-6 weeks ago. the asignem is done by the kernel which is why we dont need to track indivigual hugepages at the host level. 5.) if we dont have host side tracking we cannot do classic weighing of local resocues as we do not have teh data 6.) if we pass allocation candiates to the filters instead of hosts we can replace our exisitng filters with plamcent aware filters that can use the placement tree sturcture and traits to weight the possible allocation candiates which will inturn weigh the hosts. 7.) pcpus unlike hugepages are assigned by nova and would need to be track in memroy at the host level. this host view could be computed by the virt driver if we track the assignemt in the instace and migrations but host side track would be simplier to port the existing code too. pcpus would need to do the assignment within the driver from teh free resouce retrun by the resouce tracker. 8.) this might move some eof the logic form nova/virt/hardware.py to the libvirt dirver where i proably should always have been. 8.a) the validate of flavor exctra in nova/virt/hardware.py that is used in the api would not be moved to the driver. regards sean.
efried .