Le jeu. 21 sept. 2023 à 17:27, Karl Kloppenborg < kkloppenborg@resetdata.com.au> a écrit :
Hi Cyborg Team!
Karl from Helm Team.
When creating a VM with the correct flavor, the mdev gets created by cyborg agent and I can see it in the nodedev-list --cap mdev.
However Nova then fails with:
nova.virt.libvirt.driver [<removed>- - default default] Searching for available mdevs... _get_existing_mdevs_not_assigned /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py
:8357
2023-09-21 14:34:47.808 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Available mdevs at: set().
2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] No available mdevs where found. Creating an new one... _allocate_mdevs /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driv
er.py:8496
2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] Attempting to create new mdev... _create_new_mediated_device /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py:8385
2023-09-21 14:34:48.455 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_4b_03_1', … <truncated list>].
2023-09-21 14:34:48.456 1901814 ERROR nova.compute.manager [<removed> - - default default] [instance: 2026e2a2-b17a-43ab-adcb-62a907f58b51] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.
I don't exactly remember how Cyborg passes the devices to nova/libvirt but this exception is because none of the available GPUs have either existing mdevs or capability for creating mdevs. You should first check sysfs to double-check the state of our GPU devices in order to understand how much of vGPU capacity you still have. -Sylvain Once this happened, ARQ removes the mdev and cleans up.
I’ve got Cyborg 2023.2 running and have a device profile like so:
karl@Karls-Air ~ % openstack accelerator device profile show e2b07e11-fe69-4f33-83fc-0f9e38adb7ae
+-------------+---------------------------------------------------------------------------+
| Field | Value |
+-------------+---------------------------------------------------------------------------+
| created_at | 2023-09-21 13:30:05+00:00 |
| updated_at | None |
| uuid | e2b07e11-fe69-4f33-83fc-0f9e38adb7ae |
| name | VGPU_A40-Q48 |
| groups | [{'resources:VGPU': '1', 'trait:CUSTOM_NVIDIA_2235_A40_48Q': 'required'}] |
| description | None |
+-------------+---------------------------------------------------------------------------+
karl@Karls-Air ~ %
I can see the allocation candidate:
karl@Karls-Air ~ % openstack allocation candidate list --resource VGPU=1 | grep A40
| 41 | VGPU=1 | 229bf15f-5689-3d2c-b37b-5c8439ea6a71 | VGPU=0/1 | OWNER_CYBORG,CUSTOM_NVIDIA_2235_A40_48Q |
karl@Karls-Air ~ %
Am I missing something critical here? Because I cannot seem to figure this out… have I got a PCI address wrong, or something?
Any help from the Cyborg or Nova teams would be fantastic.
Thanks, Karl.