Cyborg nova reports mdev-capable resource is not available

Karl Kloppenborg kkloppenborg at resetdata.com.au
Fri Sep 22 02:53:26 UTC 2023


Ah thank you for pointing me towards that Alex.

I guess, I should probably look at the MIG pathway.
I wonder if it’s possible to do vGPU profiles in MIG configuration.

Have you any experience with this?

Thanks,
Karl.

From: Alex Song (宋文平) <songwenping at inspur.com>
Date: Friday, 22 September 2023 at 12:17 pm
To: Karl Kloppenborg <kkloppenborg at resetdata.com.au>, sbauza at redhat.com <sbauza at redhat.com>
Cc: openstack-discuss at lists.openstack.org <openstack-discuss at lists.openstack.org>
Subject: 答复: Cyborg nova reports mdev-capable resource is not available

Hi Karl,
         Your problem is similar with the bug: https://bugs.launchpad.net/nova/+bug/2015892
         I guess you don’t split the mig if using A serial card.

发件人: Karl Kloppenborg [mailto:kkloppenborg at resetdata.com.au]
发送时间: 2023年9月22日 0:43
收件人: Sylvain Bauza <sbauza at redhat.com>
抄送: openstack-discuss at lists.openstack.org
主题: Re: Cyborg nova reports mdev-capable resource is not available

Hi Sylvian,

Thanks for getting back to me.
So the vGPU is available and cyborg is allocating it using ARQ binding.
You can see Nova receives this request:

2023-09-21 16:38:51.889 1901814 DEBUG nova.compute.manager [None req-97062e9c-0c44-480e-9918-4a5a810175b2 78e83e5a446e4071ae43e823135dcb3c 21eb701c2a1f48b38dab8f34c0a20902 - - default default] ARQs for spec:{'2d60c353-0419-4b67-8cb7-913fc6f5cef9': {'uuid': '2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'state': 'Bound', 'device_profile_name': 'VGPU_A40-Q48', 'device_profile_group_id': 0, 'hostname': 'gpu-c-01', 'device_rp_uuid': '229bf15f-5689-3d2c-b37b-5c8439ea6a71', 'instance_uuid': '1b090007-791b-4997-af89-0feb886cf11d', 'project_id': None, 'attach_handle_type': 'MDEV', 'attach_handle_uuid': '866bd6a5-b156-4251-a969-64fefb32f16f', 'attach_handle_info': {'asked_type': 'nvidia-566', 'bus': 'ca', 'device': '01', 'domain': '0000', 'function': '1', 'vgpu_mark': 'nvidia-566_0'}, 'links': [{'href': 'http://cyborg-api.openstack.svc.cluster.local:6666/accelerator/v2/accelerator_requests/2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'rel': 'self'}], 'created_at': '2023-09-21T16:38:42+00:00', 'updated_at': '2023-09-21T16:38:42+00:00'}}, ARQs for network:{} _build_resources /var/lib/openstack/lib/python3.10/site-packages/nova/compute/manager.py:2680

So the mdev is then allocated in the resource providers at that point.

Is there some cyborg nova patching code I am missing?




From: Sylvain Bauza <sbauza at redhat.com<mailto:sbauza at redhat.com>>
Date: Friday, 22 September 2023 at 1:49 am
To: Karl Kloppenborg <kkloppenborg at resetdata.com.au<mailto:kkloppenborg at resetdata.com.au>>
Cc: openstack-discuss at lists.openstack.org<mailto:openstack-discuss at lists.openstack.org> <openstack-discuss at lists.openstack.org<mailto:openstack-discuss at lists.openstack.org>>
Subject: Re: Cyborg nova reports mdev-capable resource is not available


Le jeu. 21 sept. 2023 à 17:27, Karl Kloppenborg <kkloppenborg at resetdata.com.au<mailto:kkloppenborg at resetdata.com.au>> a écrit :
Hi Cyborg Team!
Karl from Helm Team.

When creating a VM with the correct flavor, the mdev gets created by cyborg agent and I can see it in the nodedev-list --cap mdev.
However Nova then fails with:
nova.virt.libvirt.driver [<removed>- - default default] Searching for available mdevs... _get_existing_mdevs_not_assigned /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py
:8357
2023-09-21 14:34:47.808 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Available mdevs at: set().
2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] No available mdevs where found. Creating an new one... _allocate_mdevs /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driv
er.py:8496
2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] Attempting to create new mdev... _create_new_mediated_device /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py:8385
2023-09-21 14:34:48.455 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_4b_03_1', … <truncated list>].
2023-09-21 14:34:48.456 1901814 ERROR nova.compute.manager [<removed> - - default default] [instance: 2026e2a2-b17a-43ab-adcb-62a907f58b51] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.


I don't exactly remember how Cyborg passes the devices to nova/libvirt but this exception is because none of the available GPUs have either existing mdevs or capability for creating mdevs.
You should first check sysfs to double-check the state of our GPU devices in order to understand how much of vGPU capacity you still have.

-Sylvain

Once this happened, ARQ removes the mdev and cleans up.

I’ve got Cyborg 2023.2 running and have a device profile like so:
karl at Karls-Air ~ % openstack accelerator device profile show e2b07e11-fe69-4f33-83fc-0f9e38adb7ae
+-------------+---------------------------------------------------------------------------+
| Field       | Value                                                                     |
+-------------+---------------------------------------------------------------------------+
| created_at  | 2023-09-21 13:30:05+00:00                                                 |
| updated_at  | None                                                                      |
| uuid        | e2b07e11-fe69-4f33-83fc-0f9e38adb7ae                                      |
| name        | VGPU_A40-Q48                                                              |
| groups      | [{'resources:VGPU': '1', 'trait:CUSTOM_NVIDIA_2235_A40_48Q': 'required'}] |
| description | None                                                                      |
+-------------+---------------------------------------------------------------------------+
karl at Karls-Air ~ %

I can see the allocation candidate:
karl at Karls-Air ~ % openstack allocation candidate list --resource VGPU=1 | grep A40
|  41 | VGPU=1     | 229bf15f-5689-3d2c-b37b-5c8439ea6a71 | VGPU=0/1                | OWNER_CYBORG,CUSTOM_NVIDIA_2235_A40_48Q |
karl at Karls-Air ~ %


Am I missing something critical here? Because I cannot seem to figure this out… have I got a PCI address wrong, or something?

Any help from the Cyborg or Nova teams would be fantastic.

Thanks,
Karl.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230922/24775141/attachment-0001.htm>


More information about the openstack-discuss mailing list