Re: [nova] The pros/cons for libvirt persistent assignment and DB persistent assignment.

22 Aug 2019

      On 8/21/2019 1:59 AM, Alex Xu wrote:
...
We get a lot of discussion on how to do the claim for the vpmem. There 
are a few points we are trying to match:
* Avoid race problem. (the current VGPU assignment has been found having 
race issue https://launchpad.net/bugs/1836204)
* Avoid the device assignment management to be virt driver and 
platform-specific.
* Keep it simple.
Currently, we go through two solutions here. This email is going to 
summary the pros/cons of these two solutions.
#1 Without Nova DB persistent for the assignment info, depends on 
hypervisor persistent it.
   The idea is adding 
VirtDriver.claim/unclaim_for_instance(instance_uuid, flavor_id) 
interface. The assignment info is populated from hypervisor when 
nova-compute startup. And keep in the memory of VirtDriver. The
Is there any reason the device assignment in-memory mapping has to be in 
the virt driver and not, for example, the ResourceTracker itself? This 
becomes important below.
...
instance_uuid is used to distinguish the claim from the different 
instance. The flavor_id is used for the same host resize, to distinguish 
the claim for source and target. This virt driver method is being 
invoked inside ResourceTracker to avoid the race problem. There is no 
any nova DB persistent for the assignment info.
https://review.opendev.org/#/q/status:open+project:openstack/nova+branch:mas...
pros:
* Hidden all the device detail and virt driver detail inside the virt 
driver.
* Less upgrade issue in the future since it doesn't involve any nova DB 
model change
* Expecting as simple implementation since everything inside virt driver.
cons:
    * Two cases are being found, the domain XML being lost for Libvirt 
virt driver. And we don't know other hypervisor behavior yet.
How do we "lose" the domain xml? I guess your next points are examples?
...
      * For the same host resize, the source and target instance are 
sharing single one domain XML. After the libvirt virt driver updated the 
domain XML to the target instance, the source instance's assignment 
information will be lost when a nova-compute restart happened. That 
means the resized instance can't be revert, the only choice for the user 
is to confirm the resize.
As discussed with Dan and me in IRC a week or two ago, we suggested you 
could do the same migration-based allocation switch for move operations 
as we do for cold migrate, resize and live migration since Queens, where 
the source node allocations are consumed by the migration record and the 
target node allocations are consumed by the instance. The conductor 
swaps the source node allocations before calling the scheduler which 
will create the target node allocations with the instance. On 
confirm/revert we either drop the source node allocations (held by the 
migration) or swap them back (and drop the target node allocations held 
by the instance).

In your device case, clearly conductor and placement isn't involved 
since we're not tracking those low-level details in placement. Placement 
just knows there is a certain amount of some resource class but not 
which consumers are actually assigned which devices on the hypervisor 
(like pci device management). But as far as keeping track of the 
assignments in memory, we could still do the same swap where the 
migration record is tracking the old flavor device assignments (in the 
virt driver or resource tracker) and the instance record is tracking the 
new flavor device assignments. That resolves the same-host resize case, 
correct? Doing it generically in the ResourceTracker is why I asked 
about doing that above in the RT rather than the driver.

What that doesn't solve is restarts of the compute service while there 
is a pending resize, which is why we need to persist some information 
somewhere. We could use the domain xml if it contained the flavor id, 
but it doesn't - and for same-host resize we only have one domain xml so 
that's not really an option (as you've noted).
...
      * For live migration, the target host's domain XML will be 
cleanup by libvirt after a host restart. The assignment information is 
lost before nova-compute startup and doing a cleanup.
I'm not really following you here. This is not an expected situation, 
correct? Meaning the target compute service is restarted while there is 
an in-progress live migration? I imagine if that happens we have lots of 
problems and most (manual) recovery procedures are going to involve the 
operator trying to destroy the guest and it's related resources from the 
target host and hard rebooting to recover the guest on the source host.
...
   * Can not support the same host cold migration. Since we need a way 
to identify the source and target instance's assignment in memory. But 
the same host cold migration means the same instance UUID and same 
flavor ID, there isn't another else can be used to distinguish the 
assignment.
The only in-tree virt driver that supports cold migrating on the same 
compute service host is the vmware driver, and that does not support 
things like VGPUs or VPMEMs, so I'm not sure why cold migration on the 
same host is a concern here - it's not supported and no one is working 
on adding that support.
...
   * There are workarounds added for above points, the code becomes 
fragile.
To summarize, it sounds like the biggest problem is the lack of 
persistence during a same-host resize because we'd lost the in-memory 
device assignment tracking, even if we did the migration-based 
allocation swap magic as described above.

Could we have a compromise where for all times *except* during some 
migration, we get the assigned devices from the hypervisor, but 
otherwise during a migration we store the old/new assignments in the 
MigrationContext? That would give us the persistence we need and would 
only be something that we temporarily care about during a migration. The 
thing I'm not sure about is if we do that, does it make things more 
complicated in general for the non-migration cases, or if we do it 
should we just go the extra mile and always be tracking assigned devices 
in the database exactly like what we do for PCI devices today - meaning 
we wouldn't have a special edge case just for migrations with these 
types of resources.
...
#2 With nova DB persistent, but using virt driver specific blob to store 
virt driver specific info.
   The idea is persistent the assignment for instance into DB. The 
resource tracker gets available resources from virt driver. The resource 
tracker will calculate on the fly based on available resources and 
assigned resources from instance DB. The new field ·instance.resources· 
is designed for supporting virt driver specific metadata, then hidden 
the virt driver and platform detail from RT. 
https://etherpad.openstack.org/p/vpmems-non-virt-driver-specific-new
I left some comments in the etherpad about the proposed claims process 
but the "on the fly" part concerns me for performance, especially if we 
don't make that conditional based on the types of resources we're 
claiming. During a claim the ResourceTracker already has the list of 
tracked_instances and tracked_migrations it cares about, but it sounds 
like you're proposing that we would also now have to re-fetch all of 
that data from the database just to get the resources and migration 
context information for any instances tracked by that host to determine 
what their assignments are. That seems really heavy-weight to me and is 
my major concern with this approach, well, that and the fact it sounds 
like we're creating a new version of the PCIManager (though more 
generic, it could have a lot of the same split brain type issues we've 
had with tracking PCI device inventory and allocations over the years 
since it was introduced; by split brain I mean the hypervisor saying one 
thing but nova thinking another).
...
pros:
    * Persistent assignment into instance object. Avoid the corner case 
we lost the assignment.
    * The ResourceTracker is responsible for doing the claim job. This 
is more reliable and no race problem, since ResourceTracker works very 
well for a long time.
Heh, I guess yeah. :) There are a lot of dragons in that code and we're 
still fixing bugs in it even though it should be mostly stable after all 
of these years. But resource tracking in general sucks regardless of 
where it happens (RT, placement or the virt driver) so we just have to 
be comfortable with knowing there are going to be dragons.
...
   * The virt driver specific json-blob hidden the virt driver/platform 
detail from the ResourceTracker.
Random json blobs are nasty in general especially if we need to convert 
data at runtime later for some upgrade purpose. What is proposed in the 
etherpad seems OK(ish) though given the only very random thing is the 
'metadata' field, but I could see that all getting confusing to maintain 
later when we have different schema/semantic rules about what's in the 
metadata depending on the resource class and virt driver. But we'll 
likely have that problem anyway if we go with the non-persistent option 
#1 above.
...
   * The free resource is calculated on the fly, keeping the 
implementation simple. Actually, the RT just provides a point to do the 
claim, needn't involve the complex of RT.update_available_resources
cons:
    * Doesn't like PCIManager which has both instance side and host side 
persistent info. On the fly calculation should take care of the orphaned 
instance(the instance is deleted from DB, but still existing on the 
host), so actually, it isn't unresolvable issue. And it isn't too hard 
to upgrade to have host side persistent info in the future if we want.
    * Data model change for the original proposal. Need review to decide 
the data model enough generic
Currently, Sean, Eric and I prefer the #2 now since the #1 has flaws for 
the same host resize and live migration can't be skipped by design.
At this point I can't say I have a strong opinion. I think either 
approach is going to be complicated and buggy and hard to maintain, 
especially if we don't have CI for these more exotic scenarios (which we 
don't for VGPU or VPMEM even though you said someone is working on the 
latter). I've voiced my concerns here but I'm not going to "die on a 
hill" for this, so in the end I'll likely roll over for whatever those 
of you that really care about this want to do, and know that you're 
going to be maintainers of it.

-- 

Thanks,

Matt

Re: [nova] The pros/cons for libvirt persistent assignment and DB persistent assignment.

Matt Riedemann