Cyborg nova reports mdev-capable resource is not available

Karl Kloppenborg kkloppenborg at resetdata.com.au
Thu Sep 21 16:43:13 UTC 2023


Hi Sylvian,

Thanks for getting back to me.
So the vGPU is available and cyborg is allocating it using ARQ binding.
You can see Nova receives this request:

2023-09-21 16:38:51.889 1901814 DEBUG nova.compute.manager [None req-97062e9c-0c44-480e-9918-4a5a810175b2 78e83e5a446e4071ae43e823135dcb3c 21eb701c2a1f48b38dab8f34c0a20902 - - default default] ARQs for spec:{'2d60c353-0419-4b67-8cb7-913fc6f5cef9': {'uuid': '2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'state': 'Bound', 'device_profile_name': 'VGPU_A40-Q48', 'device_profile_group_id': 0, 'hostname': 'gpu-c-01', 'device_rp_uuid': '229bf15f-5689-3d2c-b37b-5c8439ea6a71', 'instance_uuid': '1b090007-791b-4997-af89-0feb886cf11d', 'project_id': None, 'attach_handle_type': 'MDEV', 'attach_handle_uuid': '866bd6a5-b156-4251-a969-64fefb32f16f', 'attach_handle_info': {'asked_type': 'nvidia-566', 'bus': 'ca', 'device': '01', 'domain': '0000', 'function': '1', 'vgpu_mark': 'nvidia-566_0'}, 'links': [{'href': 'http://cyborg-api.openstack.svc.cluster.local:6666/accelerator/v2/accelerator_requests/2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'rel': 'self'}], 'created_at': '2023-09-21T16:38:42+00:00', 'updated_at': '2023-09-21T16:38:42+00:00'}}, ARQs for network:{} _build_resources /var/lib/openstack/lib/python3.10/site-packages/nova/compute/manager.py:2680

So the mdev is then allocated in the resource providers at that point.

Is there some cyborg nova patching code I am missing?




From: Sylvain Bauza <sbauza at redhat.com>
Date: Friday, 22 September 2023 at 1:49 am
To: Karl Kloppenborg <kkloppenborg at resetdata.com.au>
Cc: openstack-discuss at lists.openstack.org <openstack-discuss at lists.openstack.org>
Subject: Re: Cyborg nova reports mdev-capable resource is not available


Le jeu. 21 sept. 2023 à 17:27, Karl Kloppenborg <kkloppenborg at resetdata.com.au<mailto:kkloppenborg at resetdata.com.au>> a écrit :
Hi Cyborg Team!
Karl from Helm Team.

When creating a VM with the correct flavor, the mdev gets created by cyborg agent and I can see it in the nodedev-list --cap mdev.
However Nova then fails with:
nova.virt.libvirt.driver [<removed>- - default default] Searching for available mdevs... _get_existing_mdevs_not_assigned /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py
:8357
2023-09-21 14:34:47.808 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Available mdevs at: set().
2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] No available mdevs where found. Creating an new one... _allocate_mdevs /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driv
er.py:8496
2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] Attempting to create new mdev... _create_new_mediated_device /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py:8385
2023-09-21 14:34:48.455 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_4b_03_1', … <truncated list>].
2023-09-21 14:34:48.456 1901814 ERROR nova.compute.manager [<removed> - - default default] [instance: 2026e2a2-b17a-43ab-adcb-62a907f58b51] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.


I don't exactly remember how Cyborg passes the devices to nova/libvirt but this exception is because none of the available GPUs have either existing mdevs or capability for creating mdevs.
You should first check sysfs to double-check the state of our GPU devices in order to understand how much of vGPU capacity you still have.

-Sylvain

Once this happened, ARQ removes the mdev and cleans up.

I’ve got Cyborg 2023.2 running and have a device profile like so:
karl at Karls-Air ~ % openstack accelerator device profile show e2b07e11-fe69-4f33-83fc-0f9e38adb7ae
+-------------+---------------------------------------------------------------------------+
| Field       | Value                                                                     |
+-------------+---------------------------------------------------------------------------+
| created_at  | 2023-09-21 13:30:05+00:00                                                 |
| updated_at  | None                                                                      |
| uuid        | e2b07e11-fe69-4f33-83fc-0f9e38adb7ae                                      |
| name        | VGPU_A40-Q48                                                              |
| groups      | [{'resources:VGPU': '1', 'trait:CUSTOM_NVIDIA_2235_A40_48Q': 'required'}] |
| description | None                                                                      |
+-------------+---------------------------------------------------------------------------+
karl at Karls-Air ~ %

I can see the allocation candidate:
karl at Karls-Air ~ % openstack allocation candidate list --resource VGPU=1 | grep A40
|  41 | VGPU=1     | 229bf15f-5689-3d2c-b37b-5c8439ea6a71 | VGPU=0/1                | OWNER_CYBORG,CUSTOM_NVIDIA_2235_A40_48Q |
karl at Karls-Air ~ %


Am I missing something critical here? Because I cannot seem to figure this out… have I got a PCI address wrong, or something?

Any help from the Cyborg or Nova teams would be fantastic.

Thanks,
Karl.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230921/a9c8685e/attachment-0001.htm>


More information about the openstack-discuss mailing list