On 8/19/2019 2:28 AM, Nadathur, Sundar wrote:
Many of them worked as expected: pause/unpause, lock/unlock, rescue/unrescue, etc. That is, the application in the VM can successfully offload to the accelerator device before and after the sequence.
I just wanted to point out that lock/unlock has nothing to do with the guest and is control-plane only in the compute API.
But, shelve/shelve-offloaded/unshelve sequence shows two discrepancies:
* After shelve, the instance is shut off in Libvirt but is shown as ACTIVE in ‘openstack server list’.
After a successful shelve/shelve offload, the server status should be SHELVED or SHELVED_OFFLOADED, not ACTIVE. Did something fail during the shelve and the instance was left in ACTIVE state rather than ERROR state?
* After unshelve, the PCI VF gets re-attached on VM startup and the application inside the VM can access the accelerator device. However, ‘openstack resource provider usage show <rp-uuid>’ shows the RC usage as 0, i.e., there seems to be no claim in Placement for the resource in use.
What is the resource class? Something reported by cyborg on a nested resource provider under the compute node provider? Note that unshelve will go through the scheduler to pick a destination host (like the initial create) and call placement. If you're not persisting information about the resources to "claim" during scheduling on the RequestSpec, then that would need to be re-calculated and set on the RequestSpec prior to calling select_destinations during the unshelve flow in conductor. gibi's series to add move support for bandwidth-aware QoS ports is needing to do something similar. This patch is for resize/cold migration but you get the idea: https://review.opendev.org/#/c/655112/
After shelve, the instance transitions to ‘shelve-offloaded’ automatically after the configured time interval. The resource class usage is 0. This part is good. But, after the unshelve, one would think the usage would be bumped up automatically.
-- Thanks, Matt