Cyborg nova reports mdev-capable resource is not available
Hi Cyborg Team! Karl from Helm Team. When creating a VM with the correct flavor, the mdev gets created by cyborg agent and I can see it in the nodedev-list --cap mdev. However Nova then fails with: nova.virt.libvirt.driver [<removed>- - default default] Searching for available mdevs... _get_existing_mdevs_not_assigned /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py :8357 2023-09-21 14:34:47.808 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Available mdevs at: set(). 2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] No available mdevs where found. Creating an new one... _allocate_mdevs /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driv er.py:8496 2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] Attempting to create new mdev... _create_new_mediated_device /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py:8385 2023-09-21 14:34:48.455 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_4b_03_1', … <truncated list>]. 2023-09-21 14:34:48.456 1901814 ERROR nova.compute.manager [<removed> - - default default] [instance: 2026e2a2-b17a-43ab-adcb-62a907f58b51] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available. Once this happened, ARQ removes the mdev and cleans up. I’ve got Cyborg 2023.2 running and have a device profile like so: karl@Karls-Air ~ % openstack accelerator device profile show e2b07e11-fe69-4f33-83fc-0f9e38adb7ae +-------------+---------------------------------------------------------------------------+ | Field | Value | +-------------+---------------------------------------------------------------------------+ | created_at | 2023-09-21 13:30:05+00:00 | | updated_at | None | | uuid | e2b07e11-fe69-4f33-83fc-0f9e38adb7ae | | name | VGPU_A40-Q48 | | groups | [{'resources:VGPU': '1', 'trait:CUSTOM_NVIDIA_2235_A40_48Q': 'required'}] | | description | None | +-------------+---------------------------------------------------------------------------+ karl@Karls-Air ~ % I can see the allocation candidate: karl@Karls-Air ~ % openstack allocation candidate list --resource VGPU=1 | grep A40 | 41 | VGPU=1 | 229bf15f-5689-3d2c-b37b-5c8439ea6a71 | VGPU=0/1 | OWNER_CYBORG,CUSTOM_NVIDIA_2235_A40_48Q | karl@Karls-Air ~ % Am I missing something critical here? Because I cannot seem to figure this out… have I got a PCI address wrong, or something? Any help from the Cyborg or Nova teams would be fantastic. Thanks, Karl.
Le jeu. 21 sept. 2023 à 17:27, Karl Kloppenborg < kkloppenborg@resetdata.com.au> a écrit :
Hi Cyborg Team!
Karl from Helm Team.
When creating a VM with the correct flavor, the mdev gets created by cyborg agent and I can see it in the nodedev-list --cap mdev.
However Nova then fails with:
nova.virt.libvirt.driver [<removed>- - default default] Searching for available mdevs... _get_existing_mdevs_not_assigned /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py
:8357
2023-09-21 14:34:47.808 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Available mdevs at: set().
2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] No available mdevs where found. Creating an new one... _allocate_mdevs /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driv
er.py:8496
2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] Attempting to create new mdev... _create_new_mediated_device /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py:8385
2023-09-21 14:34:48.455 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_4b_03_1', … <truncated list>].
2023-09-21 14:34:48.456 1901814 ERROR nova.compute.manager [<removed> - - default default] [instance: 2026e2a2-b17a-43ab-adcb-62a907f58b51] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.
I don't exactly remember how Cyborg passes the devices to nova/libvirt but this exception is because none of the available GPUs have either existing mdevs or capability for creating mdevs. You should first check sysfs to double-check the state of our GPU devices in order to understand how much of vGPU capacity you still have. -Sylvain Once this happened, ARQ removes the mdev and cleans up.
I’ve got Cyborg 2023.2 running and have a device profile like so:
karl@Karls-Air ~ % openstack accelerator device profile show e2b07e11-fe69-4f33-83fc-0f9e38adb7ae
+-------------+---------------------------------------------------------------------------+
| Field | Value |
+-------------+---------------------------------------------------------------------------+
| created_at | 2023-09-21 13:30:05+00:00 |
| updated_at | None |
| uuid | e2b07e11-fe69-4f33-83fc-0f9e38adb7ae |
| name | VGPU_A40-Q48 |
| groups | [{'resources:VGPU': '1', 'trait:CUSTOM_NVIDIA_2235_A40_48Q': 'required'}] |
| description | None |
+-------------+---------------------------------------------------------------------------+
karl@Karls-Air ~ %
I can see the allocation candidate:
karl@Karls-Air ~ % openstack allocation candidate list --resource VGPU=1 | grep A40
| 41 | VGPU=1 | 229bf15f-5689-3d2c-b37b-5c8439ea6a71 | VGPU=0/1 | OWNER_CYBORG,CUSTOM_NVIDIA_2235_A40_48Q |
karl@Karls-Air ~ %
Am I missing something critical here? Because I cannot seem to figure this out… have I got a PCI address wrong, or something?
Any help from the Cyborg or Nova teams would be fantastic.
Thanks, Karl.
Hi Sylvian, Thanks for getting back to me. So the vGPU is available and cyborg is allocating it using ARQ binding. You can see Nova receives this request: 2023-09-21 16:38:51.889 1901814 DEBUG nova.compute.manager [None req-97062e9c-0c44-480e-9918-4a5a810175b2 78e83e5a446e4071ae43e823135dcb3c 21eb701c2a1f48b38dab8f34c0a20902 - - default default] ARQs for spec:{'2d60c353-0419-4b67-8cb7-913fc6f5cef9': {'uuid': '2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'state': 'Bound', 'device_profile_name': 'VGPU_A40-Q48', 'device_profile_group_id': 0, 'hostname': 'gpu-c-01', 'device_rp_uuid': '229bf15f-5689-3d2c-b37b-5c8439ea6a71', 'instance_uuid': '1b090007-791b-4997-af89-0feb886cf11d', 'project_id': None, 'attach_handle_type': 'MDEV', 'attach_handle_uuid': '866bd6a5-b156-4251-a969-64fefb32f16f', 'attach_handle_info': {'asked_type': 'nvidia-566', 'bus': 'ca', 'device': '01', 'domain': '0000', 'function': '1', 'vgpu_mark': 'nvidia-566_0'}, 'links': [{'href': 'http://cyborg-api.openstack.svc.cluster.local:6666/accelerator/v2/accelerato...', 'rel': 'self'}], 'created_at': '2023-09-21T16:38:42+00:00', 'updated_at': '2023-09-21T16:38:42+00:00'}}, ARQs for network:{} _build_resources /var/lib/openstack/lib/python3.10/site-packages/nova/compute/manager.py:2680 So the mdev is then allocated in the resource providers at that point. Is there some cyborg nova patching code I am missing? From: Sylvain Bauza <sbauza@redhat.com> Date: Friday, 22 September 2023 at 1:49 am To: Karl Kloppenborg <kkloppenborg@resetdata.com.au> Cc: openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org> Subject: Re: Cyborg nova reports mdev-capable resource is not available Le jeu. 21 sept. 2023 à 17:27, Karl Kloppenborg <kkloppenborg@resetdata.com.au<mailto:kkloppenborg@resetdata.com.au>> a écrit : Hi Cyborg Team! Karl from Helm Team. When creating a VM with the correct flavor, the mdev gets created by cyborg agent and I can see it in the nodedev-list --cap mdev. However Nova then fails with: nova.virt.libvirt.driver [<removed>- - default default] Searching for available mdevs... _get_existing_mdevs_not_assigned /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py :8357 2023-09-21 14:34:47.808 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Available mdevs at: set(). 2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] No available mdevs where found. Creating an new one... _allocate_mdevs /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driv er.py:8496 2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] Attempting to create new mdev... _create_new_mediated_device /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py:8385 2023-09-21 14:34:48.455 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_4b_03_1', … <truncated list>]. 2023-09-21 14:34:48.456 1901814 ERROR nova.compute.manager [<removed> - - default default] [instance: 2026e2a2-b17a-43ab-adcb-62a907f58b51] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available. I don't exactly remember how Cyborg passes the devices to nova/libvirt but this exception is because none of the available GPUs have either existing mdevs or capability for creating mdevs. You should first check sysfs to double-check the state of our GPU devices in order to understand how much of vGPU capacity you still have. -Sylvain Once this happened, ARQ removes the mdev and cleans up. I’ve got Cyborg 2023.2 running and have a device profile like so: karl@Karls-Air ~ % openstack accelerator device profile show e2b07e11-fe69-4f33-83fc-0f9e38adb7ae +-------------+---------------------------------------------------------------------------+ | Field | Value | +-------------+---------------------------------------------------------------------------+ | created_at | 2023-09-21 13:30:05+00:00 | | updated_at | None | | uuid | e2b07e11-fe69-4f33-83fc-0f9e38adb7ae | | name | VGPU_A40-Q48 | | groups | [{'resources:VGPU': '1', 'trait:CUSTOM_NVIDIA_2235_A40_48Q': 'required'}] | | description | None | +-------------+---------------------------------------------------------------------------+ karl@Karls-Air ~ % I can see the allocation candidate: karl@Karls-Air ~ % openstack allocation candidate list --resource VGPU=1 | grep A40 | 41 | VGPU=1 | 229bf15f-5689-3d2c-b37b-5c8439ea6a71 | VGPU=0/1 | OWNER_CYBORG,CUSTOM_NVIDIA_2235_A40_48Q | karl@Karls-Air ~ % Am I missing something critical here? Because I cannot seem to figure this out… have I got a PCI address wrong, or something? Any help from the Cyborg or Nova teams would be fantastic. Thanks, Karl.
<span lang=3DEN-AU =
<span lang=3DEN-AU style=3D'font-size:11.0pt'>> a = =A8=A6crit :<o:p></o:p></span></p></div><blockquote =
0 *H÷ 010 +0 *H÷ $Content-Type: multipart/alternative; boundary="----=_NextPart_000_0292_01D9ED3D.C9B1F010" This is a multipart message in MIME format. ------=_NextPart_000_0292_01D9ED3D.C9B1F010 Content-Type: text/plain; charset="gb2312" Content-Transfer-Encoding: 8bit Hi Karl, Your problem is similar with the bug: https://bugs.launchpad.net/nova/+bug/2015892 I guess you don¡¯t split the mig if using A serial card. ·¢ŒþÈË: Karl Kloppenborg [mailto:kkloppenborg@resetdata.com.au] ·¢ËÍʱŒä: 2023Äê9ÔÂ22ÈÕ 0:43 ÊÕŒþÈË: Sylvain Bauza <sbauza@redhat.com> ³ËÍ: openstack-discuss@lists.openstack.org Ö÷Ìâ: Re: Cyborg nova reports mdev-capable resource is not available Hi Sylvian, Thanks for getting back to me. So the vGPU is available and cyborg is allocating it using ARQ binding. You can see Nova receives this request: 2023-09-21 16:38:51.889 1901814 DEBUG nova.compute.manager [None req-97062e9c-0c44-480e-9918-4a5a810175b2 78e83e5a446e4071ae43e823135dcb3c 21eb701c2a1f48b38dab8f34c0a20902 - - default default] ARQs for spec:{'2d60c353-0419-4b67-8cb7-913fc6f5cef9': {'uuid': '2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'state': 'Bound', 'device_profile_name': 'VGPU_A40-Q48', 'device_profile_group_id': 0, 'hostname': 'gpu-c-01', 'device_rp_uuid': '229bf15f-5689-3d2c-b37b-5c8439ea6a71', 'instance_uuid': '1b090007-791b-4997-af89-0feb886cf11d', 'project_id': None, 'attach_handle_type': 'MDEV', 'attach_handle_uuid': '866bd6a5-b156-4251-a969-64fefb32f16f', 'attach_handle_info': {'asked_type': 'nvidia-566', 'bus': 'ca', 'device': '01', 'domain': '0000', 'function': '1', 'vgpu_mark': 'nvidia-566_0'}, 'links': [{'href': 'http://cyborg-api.openstack.svc.cluster.local:6666/accelerator/v2/accelerat or_requests/2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'rel': 'self'}], 'created_at': '2023-09-21T16:38:42+00:00', 'updated_at': '2023-09-21T16:38:42+00:00'}}, ARQs for network:{} _build_resources /var/lib/openstack/lib/python3.10/site-packages/nova/compute/manager.py:2680 So the mdev is then allocated in the resource providers at that point. Is there some cyborg nova patching code I am missing? From: Sylvain Bauza < <mailto:sbauza@redhat.com> sbauza@redhat.com> Date: Friday, 22 September 2023 at 1:49 am To: Karl Kloppenborg < <mailto:kkloppenborg@resetdata.com.au> kkloppenborg@resetdata.com.au> Cc: <mailto:openstack-discuss@lists.openstack.org> openstack-discuss@lists. openstack.org < <mailto:openstack-discuss@lists.openstack.org> openstack-discuss@lists.openstack.org> Subject: Re: Cyborg nova reports mdev-capable resource is not available Le jeu. 21 sept. 2023 š€ 17:27, Karl Kloppenborg < <mailto:kkloppenborg@resetdata.com.au> kkloppenborg@resetdata.com.au> a šŠ crit : Hi Cyborg Team! Karl from Helm Team. When creating a VM with the correct flavor, the mdev gets created by cyborg agent and I can see it in the nodedev-list --cap mdev. However Nova then fails with: nova.virt.libvirt.driver [<removed>- - default default] Searching for available mdevs... _get_existing_mdevs_not_assigned /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py :8357 2023-09-21 14:34:47.808 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Available mdevs at: set(). 2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] No available mdevs where found. Creating an new one... _allocate_mdevs /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driv er.py:8496 2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] Attempting to create new mdev... _create_new_mediated_device /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py: 8385 2023-09-21 14:34:48.455 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_4b_03_1', ¡ <truncated list>]. 2023-09-21 14:34:48.456 1901814 ERROR nova.compute.manager [<removed> - - default default] [instance: 2026e2a2-b17a-43ab-adcb-62a907f58b51] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not ava\ilable. I don't exactly remember how Cyborg passes the devices to nova/libvirt but this exception is because none of the available GPUs have either existing mdevs or capability for creating mdevs. You should first check sysfs to double-check the state of our GPU devices in order to understand how much of vGPU capacity you still have. -Sylvain Once this happened, ARQ removes the mdev and cleans up. I¡¯ve got Cyborg 2023.2 running and have a device profile like so: karl@Karls-Air ~ % openstack accelerator device profile show e2b07e11-fe69-4f33-83fc-0f9e38adb7ae +-------------+------------------------------------------------------------- --------------+ | Field | Value | +-------------+------------------------------------------------------------- --------------+ | created_at | 2023-09-21 13:30:05+00:00 | | updated_at | None | | uuid | e2b07e11-fe69-4f33-83fc-0f9e38adb7ae | | name | VGPU_A40-Q48 | | groups | [{'resources:VGPU': '1', 'trait:CUSTOM_NVIDIA_2235_A40_48Q': 'required'}] | | description | None | +-------------+------------------------------------------------------------- --------------+ karl@Karls-Air ~ % I can see the allocation candidate: karl@Karls-Air ~ % openstack allocation candidate list --resource VGPU=1 | grep A40 | 41 | VGPU=1 | 229bf15f-5689-3d2c-b37b-5c8439ea6a71 | VGPU=0/1 | OWNER_CYBORG,CUSTOM_NVIDIA_2235_A40_48Q | karl@Karls-Air ~ % Am I missing something critical here? Because I cannot seem to figure this out¡ have I got a PCI address wrong, or something? Any help from the Cyborg or Nova teams would be fantastic. Thanks, Karl. ------=_NextPart_000_0292_01D9ED3D.C9B1F010 Content-Type: text/html; charset="gb2312" Content-Transfer-Encoding: quoted-printable <html xmlns:v=3D"urn:schemas-microsoft-com:vml" = xmlns:o=3D"urn:schemas-microsoft-com:office:office" = xmlns:w=3D"urn:schemas-microsoft-com:office:word" = xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" = xmlns=3D"http://www.w3.org/TR/REC-html40"><head><meta = http-equiv=3DContent-Type content=3D"text/html; charset=3Dgb2312"><meta = name=3DGenerator content=3D"Microsoft Word 15 (filtered = medium)"><style><!-- /* Font Definitions */ @font-face {font-family:=CB=CE=CC=E5; panose-1:2 1 6 0 3 1 1 1 1 1;} @font-face {font-family:=CB=CE=CC=E5; panose-1:2 1 6 0 3 1 1 1 1 1;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:"\@=CB=CE=CC=E5"; panose-1:2 1 6 0 3 1 1 1 1 1;} @font-face {font-family:=CE=A2=C8=ED=D1=C5=BA=DA; panose-1:2 11 5 3 2 2 4 2 2 4;} @font-face {font-family:"\@=CE=A2=C8=ED=D1=C5=BA=DA"; panose-1:2 11 5 3 2 2 4 2 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0cm; margin-bottom:.0001pt; font-size:10.0pt; font-family:"Calibri",sans-serif;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal; font-family:"Calibri",sans-serif; color:windowtext;} span.EmailStyle18 {mso-style-type:personal-reply; font-family:"Calibri",sans-serif; color:#1F497D;} .MsoChpDefault {mso-style-type:export-only; font-size:10.0pt;} @page WordSection1 {size:612.0pt 792.0pt; margin:72.0pt 72.0pt 72.0pt 72.0pt;} div.WordSection1 {page:WordSection1;} --></style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext=3D"edit"> <o:idmap v:ext=3D"edit" data=3D"1" /> </o:shapelayout></xml><![endif]--></head><body lang=3DZH-CN link=3Dblue = vlink=3Dpurple><div class=3DWordSection1><p class=3DMsoNormal><span = lang=3DEN-US = style=3D'font-size:10.5pt;color:#1F497D'><o:p> </o:p></span></p><p = class=3DMsoNormal><span lang=3DEN-US = style=3D'font-size:10.5pt;color:#1F497D'>Hi = Karl,<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US = style=3D'font-size:10.5pt;color:#1F497D'> &n= bsp; Your problem is similar with the bug:</span><span = lang=3DEN-US> </span><span lang=3DEN-US = style=3D'font-size:10.5pt;color:#1F497D'><a = href=3D"https://bugs.launchpad.net/nova/+bug/2015892">https://bugs.launch= pad.net/nova/+bug/2015892</a><o:p></o:p></span></p><p = class=3DMsoNormal><span lang=3DEN-US = style=3D'font-size:10.5pt;color:#1F497D'> &n= bsp; I guess you don=A1=AFt split the mig if using A serial = card.<o:p></o:p></span></p><p class=3DMsoNormal><a = name=3D"_MailEndCompose"><span lang=3DEN-US = style=3D'font-size:10.5pt;color:#1F497D'><o:p> </o:p></span></a></p>= <div><div style=3D'border:none;border-top:solid #E1E1E1 = 1.0pt;padding:3.0pt 0cm 0cm 0cm'><p class=3DMsoNormal><b><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>=B7=A2=BC=FE=C8=CB<span lang=3DEN-US>:</span></span></b><span = lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'> Karl Kloppenborg [mailto:kkloppenborg@resetdata.com.au] = <br></span><b><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>=B7=A2=CB=CD=CA=B1=BC=E4<span lang=3DEN-US>:</span></span></b><span = lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'> 2023</span><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>=C4=EA<span lang=3DEN-US>9</span>=D4=C2<span = lang=3DEN-US>22</span>=C8=D5<span lang=3DEN-US> = 0:43<br></span><b>=CA=D5=BC=FE=C8=CB<span lang=3DEN-US>:</span></b><span = lang=3DEN-US> Sylvain Bauza = <sbauza@redhat.com><br></span><b>=B3=AD=CB=CD<span = lang=3DEN-US>:</span></b><span lang=3DEN-US> = openstack-discuss@lists.openstack.org<br></span><b>=D6=F7=CC=E2<span = lang=3DEN-US>:</span></b><span lang=3DEN-US> Re: Cyborg nova reports = mdev-capable resource is not = available<o:p></o:p></span></span></p></div></div><p = class=3DMsoNormal><span lang=3DEN-US><o:p> </o:p></span></p><p = class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>Hi = Sylvian,<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'><o:p> </o:p></= span></p><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>Thanks for getting = back to me.<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>So the vGPU is = available and cyborg is allocating it using ARQ = binding.<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>You can see Nova = receives this request:<o:p></o:p></span></p><p class=3DMsoNormal><span = lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'><o:p> </o:p></= span></p><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>2023-09-21 = 16:38:51.889 1901814 DEBUG nova.compute.manager [None = req-97062e9c-0c44-480e-9918-4a5a810175b2 = 78e83e5a446e4071ae43e823135dcb3c 21eb701c2a1f48b38dab8f34c0a20902 - - = default default] ARQs for spec:{'2d60c353-0419-4b67-8cb7-913fc6f5cef9': = {'uuid': '2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'state': 'Bound', = 'device_profile_name': 'VGPU_A40-Q48', 'device_profile_group_id': 0, = 'hostname': 'gpu-c-01', 'device_rp_uuid': = '229bf15f-5689-3d2c-b37b-5c8439ea6a71', 'instance_uuid': = '1b090007-791b-4997-af89-0feb886cf11d', 'project_id': None, = 'attach_handle_type': 'MDEV', 'attach_handle_uuid': = '866bd6a5-b156-4251-a969-64fefb32f16f', 'attach_handle_info': = {'asked_type': 'nvidia-566', 'bus': 'ca', 'device': '01', 'domain': = '0000', 'function': '1', 'vgpu_mark': 'nvidia-566_0'}, 'links': = [{'href': = 'http://cyborg-api.openstack.svc.cluster.local:6666/accelerator/v2/accele= rator_requests/2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'rel': 'self'}], = 'created_at': '2023-09-21T16:38:42+00:00', 'updated_at': = '2023-09-21T16:38:42+00:00'}}, ARQs for network:{} _build_resources = /var/lib/openstack/lib/python3.10/site-packages/nova/compute/manager.py:2= 680<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'><o:p> </o:p></= span></p><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>So the mdev is = then allocated in the resource providers at that = point.<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'><o:p> </o:p></= span></p><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>Is there some = cyborg nova patching code I am missing?<o:p></o:p></span></p><p = class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'><o:p> </o:p></= span></p><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'><o:p> </o:p></= span></p><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'><o:p> </o:p></= span></p><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'><o:p> </o:p></= span></p><div id=3Dmail-editor-reference-message-container><div><div = style=3D'border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm = 0cm 0cm'><p class=3DMsoNormal style=3D'margin-bottom:12.0pt'><b><span = lang=3DEN-AU style=3D'font-size:12.0pt;color:black'>From: = </span></b><span lang=3DEN-AU = style=3D'font-size:12.0pt;color:black'>Sylvain Bauza <</span><span = lang=3DEN-US><a href=3D"mailto:sbauza@redhat.com"><span lang=3DEN-AU = style=3D'font-size:12.0pt'>sbauza@redhat.com</span></a></span><span = lang=3DEN-AU style=3D'font-size:12.0pt;color:black'>><br><b>Date: = </b>Friday, 22 September 2023 at 1:49 am<br><b>To: </b>Karl Kloppenborg = <</span><span lang=3DEN-US><a = href=3D"mailto:kkloppenborg@resetdata.com.au"><span lang=3DEN-AU = style=3D'font-size:12.0pt'>kkloppenborg@resetdata.com.au</span></a></span= style=3D'font-size:12.0pt;color:black'>><br><b>Cc: </b></span><span = lang=3DEN-US><a = href=3D"mailto:openstack-discuss@lists.openstack.org"><span lang=3DEN-AU = style=3D'font-size:12.0pt'>openstack-discuss@lists.openstack.org</span></= a></span><span lang=3DEN-AU style=3D'font-size:12.0pt;color:black'> = <</span><span lang=3DEN-US><a = href=3D"mailto:openstack-discuss@lists.openstack.org"><span lang=3DEN-AU = style=3D'font-size:12.0pt'>openstack-discuss@lists.openstack.org</span></= a></span><span lang=3DEN-AU = style=3D'font-size:12.0pt;color:black'>><br><b>Subject: </b>Re: = Cyborg nova reports mdev-capable resource is not = available<o:p></o:p></span></p></div><div><div><p = class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p> </o:p></span></p></div><p = class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p> </o:p></span></p><div><div><p = class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt'>Le jeu. 21 sept. 2023 =A8=A4 17:27, = Karl Kloppenborg <</span><span lang=3DEN-US><a = href=3D"mailto:kkloppenborg@resetdata.com.au"><span lang=3DEN-AU = style=3D'font-size:11.0pt'>kkloppenborg@resetdata.com.au</span></a></span= style=3D'border:none;border-left:solid #CCCCCC 1.0pt;padding:0cm 0cm 0cm = 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0cm;margin-bottom:5= .0pt'><div><div><div><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>Hi Cyborg = Team!</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>Karl from Helm = Team.</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'> </span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>When creating a VM = with the correct flavor, the mdev gets created by cyborg agent and I can = see it in the nodedev-list --cap mdev.</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>However Nova then = fails with:</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU = style=3D'font-size:11.0pt;color:#1D1D1D'>nova.virt.libvirt.driver = [<removed>- - default default] Searching for available mdevs... = _get_existing_mdevs_not_assigned = /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.= py</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>:8357</span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>2023-09-21 = 14:34:47.808 1901814 INFO nova.virt.libvirt.driver [<removed> - - = default default] Available mdevs at: set().</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>2023-09-21 = 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - = default default] No available mdevs where found. Creating an new one... = _allocate_mdevs = /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driv</s= pan><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU = style=3D'font-size:11.0pt;color:#1D1D1D'>er.py:8496</span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>2023-09-21 = 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - = default default] Attempting to create new mdev... = _create_new_mediated_device = /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.= py:8385</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>2023-09-21 = 14:34:48.455 1901814 INFO nova.virt.libvirt.driver [<removed> - - = default default] Failed to create mdev. No free space found among the = following devices: ['pci_0000_4b_03_1', =A1=AD <truncated = list>].</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>2023-09-21 = 14:34:48.456 1901814 ERROR nova.compute.manager [<removed> - - = default default] [instance: 2026e2a2-b17a-43ab-adcb-62a907f58b51] = Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: = Insufficient compute resources: mdev-capable resource is not = available.</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'> </span><span = lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p></div></div></div></bloc= kquote><div><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p> </o:p></span></p></div><div><p = class=3DMsoNormal><span lang=3DEN-AU style=3D'font-size:11.0pt'>I don't = exactly remember how Cyborg passes the devices to nova/libvirt but this = exception is because none of the available GPUs have either existing = mdevs or capability for creating = mdevs.<o:p></o:p></span></p></div><div><p class=3DMsoNormal><span = lang=3DEN-AU style=3D'font-size:11.0pt'>You should first check sysfs to = double-check the state of our GPU devices in order to understand how = much of vGPU capacity you still have. = <o:p></o:p></span></p></div><div><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p> </o:p></span></p></div><div><p = class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt'>-Sylvain<o:p></o:p></span></p></div><div><p = class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p> </o:p></span></p></div><blockquote = style=3D'border:none;border-left:solid #CCCCCC 1.0pt;padding:0cm 0cm 0cm = 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0cm;margin-bottom:5= .0pt'><div><div><div><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>Once this = happened, ARQ removes the mdev and cleans up.</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'> </span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>I=A1=AFve got = Cyborg 2023.2 running and have a device profile like so:</span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>karl@Karls-Air ~ % = openstack accelerator device profile show = e2b07e11-fe69-4f33-83fc-0f9e38adb7ae</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU = style=3D'font-size:11.0pt;color:#1D1D1D'>+-------------+-----------------= ----------------------------------------------------------+</span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>| = Field | = Value &n= bsp; &nb= sp; &nbs= p;  = ; = |</span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU = style=3D'font-size:11.0pt;color:#1D1D1D'>+-------------+-----------------= ----------------------------------------------------------+</span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>| created_at = | 2023-09-21 = 13:30:05+00:00  = ; = &= nbsp; &n= bsp; |</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>| updated_at = | = None &nb= sp; &nbs= p;  = ; = &= nbsp; |</span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>| = uuid | = e2b07e11-fe69-4f33-83fc-0f9e38adb7ae &= nbsp; &n= bsp; &nb= sp;  |</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>| = name | = VGPU_A40-Q48 &= nbsp; &n= bsp; &nb= sp; &nbs= p;  = ; |</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>| = groups | [{'resources:VGPU': '1', = 'trait:CUSTOM_NVIDIA_2235_A40_48Q': 'required'}] |</span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>| description | = None &nb= sp; &nbs= p;  = ; = &= nbsp; |</span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU = style=3D'font-size:11.0pt;color:#1D1D1D'>+-------------+-----------------= ----------------------------------------------------------+</span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>karl@Karls-Air ~ = %</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'> </span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>I can see the = allocation candidate:</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>karl@Karls-Air ~ % = openstack allocation candidate list --resource VGPU=3D1 | grep = A40</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>| 41 | = VGPU=3D1 | 229bf15f-5689-3d2c-b37b-5c8439ea6a71 = | = VGPU=3D0/1 &nb= sp; | OWNER_CYBORG,CUSTOM_NVIDIA_2235_A40_48Q = |</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>karl@Karls-Air ~ = %</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'> </span><¯span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'> </span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>Am I missing = something critical here? Because I cannot seem to figure this out=A1=AD = have I got a PCI address wrong, or something?</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'> </span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>Any help from the = Cyborg or Nova teams would be fantastic.</span><span lang=3DEN-AU = style=3D'font-size:11.0pt'><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'> </span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU = style=3D'font-size:11.0pt;color:#1D1D1D'>Thanks,<br>Karl.</span><span = lang=3DEN-AU style=3D'font-size:11.0pt'><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU = style=3D'font-size:11.0pt'> <o:p></o:p></span></p></div></div></div>= </blockquote></div></div></div></div></div></body></html> ------=_NextPart_000_0292_01D9ED3D.C9B1F010-- í0¢0 cÊR,S8vMÂä:ÐVê0 *H÷ 0Y10 &ò,dcom10 &ò,dlangchao10 &ò,dhome10U INSPUR-CA0 170109092830Z 270109093829Z0Y10 &ò,dcom10 &ò,dlangchao10 &ò,dhome10U INSPUR-CA0"0 *H÷ 0 «ä5ïc$Œ©æ'µ¯Þ6>úUKÛdÔ²Áe9Î~{BîÒLgD÷*wvVÊŠ/DýUj_xá\m/ óž=kзéGÙœQ€ýx~Wùgk ÛÜøãÔ7É6NçÏ*?n°Ê²mhùè{ïôÌÆ 7üF-Î<@ÃÓͬWçÅyåLZrF 6~føÈ×T~$0d¡ýL|zšøW=ötÚ%ýq,¥Ã~Ÿ" ÀýŸÑö2T,QÕÔ,dºÂÅ^§ÈôïåJ)ëVvp Ó£f0d0 +7CA0U0Uÿ0ÿ0U^YŠŽLX`Nöµ¥9Š2Á5j0 +70 *H÷ JÄß»íu+¶ Wù«áMÈL¯i8y1ü'áRæL€uK·u`K=9@žnA0׊qÓÂéÑVxÓf Ý»Åü<ÔUrpŠz¶Ï7vèÑ¥`oñ÷7Ä°ù¬?Þ`PMKŠžp»9 ìcwg UhÁþuu|6~ŒÁècŸŸkl!êkMÔS3B±Æù¯:«ççA^5"ÆX¯ èË&á)PNäŒØkäA-7fGR,¿äŸœÞ}ÖfeKéK¥[ H¢WžÓ{oÊù ,F_5°ÎZË>u0C0+ ~+mçbí++0 *H÷ 0Y10 &ò,dcom10 &ò,dlangchao10 &ò,dhome10U INSPUR-CA0 191128064731Z 241126064731Z010 &ò,dcom10 &ò,dlangchao10 &ò,dhome10U浪朮信æ¯10U å®æå¹³1%0# *H÷ songwenping@inspur.com0"0 *H÷ 0 ã4ÒEd$i÷18#,åH~èyµþJRáLêg¡$@6vÔ<sâelPO4ýooXZç1¢iéÁ IòKjÚO§h;Rfùê7Ÿp¬ŒæäÝPòüÿ~æ E}é{có» (ù'+zÞ s ,!K£;û¥¶DEu1+ðÏ»õØÛähJjŸj']a¥Î·7¡õòhõßHŒ8QzÆ{ËÍ73>, ªí|gœ0y€Xv'ÚÊ ¬ÀO#)³°øÕM^œ¢VñóŸ¯ìFñ;µ5HŽù£Ã0¿0= +700.&+7ò©×z©=÷Ø\Jý&§Md\0)U%"0 ++ +7 0U 05 +7 (0&0 +0 +0 +7 0D *H÷ 7050*H÷ 0*H÷ 0+0 *H÷ 0UW 0²^qûìÇUã¡b[]s0U#0^YŠŽLX`Nöµ¥9Š2Á5j0U00ÿ ü ùºldap:///CN=INSPUR-CA,CN=JTCA2012,CN=CDP,CN=Public%20Key%20Services,CN=Services,CN=Configuration,DC=home,DC=langchao,DC=com?certificateRevocationList?base?objectClass=cRLDistributionPoint:http://JTCA2012.home.langchao.com/CertEnroll/INSPUR-CA.crl0)+00±+0€ldap:///CN=INSPUR-CA,CN=AIA,CN=Public%20Key%20Services,CN=Services,CN=Configuration,DC=home,DC=langchao,DC=com?cACertificate?base?objectClass=certificationAuthority0a+0Uhttp://JTCA2012.home.langchao.com/CertEnroll/JTCA2012.home.langchao.com_INSPUR-CA.crt0IUB0@ & +7 songwenping@inspur.comsongwenping@inspur.com0 *H÷ §öãý+·>³â¿Xíe-÷¯[ÂΩÉ\ÃAæMà.U'ù||Wø$Šn ' `9ݵ¯éÉM^mz0Äð -AÒ¯×Tï+£§¡RbfzÆX#ùkY<ØDþxólõA%ÑøN¬×grÛÚ·ãF!GþóüVœÍ±Ð[ÏL®GÅ3€$ /DžÃiú_h|cI}DôþyªÉñ Ç}³W$_Äáy]ò±DÄ1(tr)6gêäÑæÄLd$!t{x±Z±·þÏeqmkþÒÒ÷r³ßxõ%lUÉÒ4d100p0Y10 &ò,dcom10 &ò,dlangchao10 &ò,dhome10U INSPUR-CA~+mçbí++0 + ø0 *H÷ 1 *H÷ 0 *H÷ 1 230922021601Z0# *H÷ 1ÊÔF:œæ ùáQ3¯Rþå0 +71r0p0Y10 &ò,dcom10 &ò,dlangchao10 &ò,dhome10U INSPUR-CA~+mçbí++0*H÷ 1r p0Y10 &ò,dcom10 &ò,dlangchao10 &ò,dhome10U INSPUR-CA~+mçbí++0 *H÷ 1 00 `He*0 `He0 *H÷ 0 `He0*H÷ 0 *H÷ @0+0 `He0 `He0 `He0 *H÷ yÑWÑŸ0!à C_ ¬4Ë Úý¥W¬?C s}ö9nðéÒU'gžò©€ÞÚŒÔ<Õºëò:@{*LwÖ£õúÔÄñ>õaÄ1&FÿCTŒ á#òxâÚŽb×ËsÇCÏ}0ÔPÙ; Ü]94Ä¡bÔãù#¿÷f7P®ÄŒ84Ò'Î0{t²S6DìBÀ2`/Õûp@š«.>§\ßû6äSVw RvÁh×{»§;<¬rÁ_dx( c²þÅ(12®/45·î:ÒW1þMcRmÒ7ŒU³
Ah thank you for pointing me towards that Alex. I guess, I should probably look at the MIG pathway. I wonder if it’s possible to do vGPU profiles in MIG configuration. Have you any experience with this? Thanks, Karl. From: Alex Song (宋文平) <songwenping@inspur.com> Date: Friday, 22 September 2023 at 12:17 pm To: Karl Kloppenborg <kkloppenborg@resetdata.com.au>, sbauza@redhat.com <sbauza@redhat.com> Cc: openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org> Subject: 答复: Cyborg nova reports mdev-capable resource is not available Hi Karl, Your problem is similar with the bug: https://bugs.launchpad.net/nova/+bug/2015892 I guess you don’t split the mig if using A serial card. 发件人: Karl Kloppenborg [mailto:kkloppenborg@resetdata.com.au] 发送时间: 2023年9月22日 0:43 收件人: Sylvain Bauza <sbauza@redhat.com> 抄送: openstack-discuss@lists.openstack.org 主题: Re: Cyborg nova reports mdev-capable resource is not available Hi Sylvian, Thanks for getting back to me. So the vGPU is available and cyborg is allocating it using ARQ binding. You can see Nova receives this request: 2023-09-21 16:38:51.889 1901814 DEBUG nova.compute.manager [None req-97062e9c-0c44-480e-9918-4a5a810175b2 78e83e5a446e4071ae43e823135dcb3c 21eb701c2a1f48b38dab8f34c0a20902 - - default default] ARQs for spec:{'2d60c353-0419-4b67-8cb7-913fc6f5cef9': {'uuid': '2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'state': 'Bound', 'device_profile_name': 'VGPU_A40-Q48', 'device_profile_group_id': 0, 'hostname': 'gpu-c-01', 'device_rp_uuid': '229bf15f-5689-3d2c-b37b-5c8439ea6a71', 'instance_uuid': '1b090007-791b-4997-af89-0feb886cf11d', 'project_id': None, 'attach_handle_type': 'MDEV', 'attach_handle_uuid': '866bd6a5-b156-4251-a969-64fefb32f16f', 'attach_handle_info': {'asked_type': 'nvidia-566', 'bus': 'ca', 'device': '01', 'domain': '0000', 'function': '1', 'vgpu_mark': 'nvidia-566_0'}, 'links': [{'href': 'http://cyborg-api.openstack.svc.cluster.local:6666/accelerator/v2/accelerato...', 'rel': 'self'}], 'created_at': '2023-09-21T16:38:42+00:00', 'updated_at': '2023-09-21T16:38:42+00:00'}}, ARQs for network:{} _build_resources /var/lib/openstack/lib/python3.10/site-packages/nova/compute/manager.py:2680 So the mdev is then allocated in the resource providers at that point. Is there some cyborg nova patching code I am missing? From: Sylvain Bauza <sbauza@redhat.com<mailto:sbauza@redhat.com>> Date: Friday, 22 September 2023 at 1:49 am To: Karl Kloppenborg <kkloppenborg@resetdata.com.au<mailto:kkloppenborg@resetdata.com.au>> Cc: openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org> <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: Cyborg nova reports mdev-capable resource is not available Le jeu. 21 sept. 2023 à 17:27, Karl Kloppenborg <kkloppenborg@resetdata.com.au<mailto:kkloppenborg@resetdata.com.au>> a écrit : Hi Cyborg Team! Karl from Helm Team. When creating a VM with the correct flavor, the mdev gets created by cyborg agent and I can see it in the nodedev-list --cap mdev. However Nova then fails with: nova.virt.libvirt.driver [<removed>- - default default] Searching for available mdevs... _get_existing_mdevs_not_assigned /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py :8357 2023-09-21 14:34:47.808 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Available mdevs at: set(). 2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] No available mdevs where found. Creating an new one... _allocate_mdevs /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driv er.py:8496 2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] Attempting to create new mdev... _create_new_mediated_device /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py:8385 2023-09-21 14:34:48.455 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_4b_03_1', … <truncated list>]. 2023-09-21 14:34:48.456 1901814 ERROR nova.compute.manager [<removed> - - default default] [instance: 2026e2a2-b17a-43ab-adcb-62a907f58b51] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available. I don't exactly remember how Cyborg passes the devices to nova/libvirt but this exception is because none of the available GPUs have either existing mdevs or capability for creating mdevs. You should first check sysfs to double-check the state of our GPU devices in order to understand how much of vGPU capacity you still have. -Sylvain Once this happened, ARQ removes the mdev and cleans up. I’ve got Cyborg 2023.2 running and have a device profile like so: karl@Karls-Air ~ % openstack accelerator device profile show e2b07e11-fe69-4f33-83fc-0f9e38adb7ae +-------------+---------------------------------------------------------------------------+ | Field | Value | +-------------+---------------------------------------------------------------------------+ | created_at | 2023-09-21 13:30:05+00:00 | | updated_at | None | | uuid | e2b07e11-fe69-4f33-83fc-0f9e38adb7ae | | name | VGPU_A40-Q48 | | groups | [{'resources:VGPU': '1', 'trait:CUSTOM_NVIDIA_2235_A40_48Q': 'required'}] | | description | None | +-------------+---------------------------------------------------------------------------+ karl@Karls-Air ~ % I can see the allocation candidate: karl@Karls-Air ~ % openstack allocation candidate list --resource VGPU=1 | grep A40 | 41 | VGPU=1 | 229bf15f-5689-3d2c-b37b-5c8439ea6a71 | VGPU=0/1 | OWNER_CYBORG,CUSTOM_NVIDIA_2235_A40_48Q | karl@Karls-Air ~ % Am I missing something critical here? Because I cannot seem to figure this out… have I got a PCI address wrong, or something? Any help from the Cyborg or Nova teams would be fantastic. Thanks, Karl.
<span lang=3DEN-AU style=3D'font-size:12.0pt;color:black'>>, = </span><span lang=3DEN-US><a href=3D"mailto:sbauza@redhat.com"><span = lang=3DEN-AU =
<span lang=3DEN-AU =
<span lang=3DEN-AU style=3D'font-size:11.0pt'>> a = =A8=A6crit :</span><span = lang=3DEN-AU><o:p></o:p></span></p></div><blockquote =
0 *H÷ 010 +0 *H÷ $Content-Type: multipart/alternative; boundary="----=_NextPart_000_03AC_01D9ED5B.A9ADB240" This is a multipart message in MIME format. ------=_NextPart_000_03AC_01D9ED5B.A9ADB240 Content-Type: text/plain; charset="gb2312" Content-Transfer-Encoding: 8bit Please reference the nvidia official doc: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#introduction ·¢ŒþÈË: Karl Kloppenborg [mailto:kkloppenborg@resetdata.com.au] ·¢ËÍʱŒä: 2023Äê9ÔÂ22ÈÕ 10:53 ÊÕŒþÈË: Alex Song (ËÎÎÄÆœ) <songwenping@inspur.com>; sbauza@redhat.com ³ËÍ: openstack-discuss@lists.openstack.org Ö÷Ìâ: Re: Cyborg nova reports mdev-capable resource is not available Ah thank you for pointing me towards that Alex. I guess, I should probably look at the MIG pathway. I wonder if it¡¯s possible to do vGPU profiles in MIG configuration. Have you any experience with this? Thanks, Karl. From: Alex Song (ËÎÎÄÆœ) < <mailto:songwenping@inspur.com> songwenping@inspur.com> Date: Friday, 22 September 2023 at 12:17 pm To: Karl Kloppenborg < <mailto:kkloppenborg@resetdata.com.au> kkloppenborg@resetdata.com.au>, <mailto:sbauza@redhat.com> sbauza@redhat.com < <mailto:sbauza@redhat.com> sbauza@redhat.com> Cc: <mailto:openstack-discuss@lists.openstack.org> openstack-discuss@lists. openstack.org < <mailto:openstack-discuss@lists.openstack.org> openstack-discuss@lists.openstack.org> Subject: ŽðžŽ: Cyborg nova reports mdev-capable resource is not available Hi Karl, Your problem is similar with the bug: <https://bugs.launchpad.net/nova/+bug/2015892> https://bugs.launchpad.net/nova/+bug/2015892 I guess you don¡¯t split the mig if using A serial card. ·¢ŒþÈË: Karl Kloppenborg [ <mailto:kkloppenborg@resetdata.com.au> mailto:kkloppenborg@resetdata.com.au] ·¢ËÍʱŒä: 2023Äê9ÔÂ22ÈÕ 0:43 ÊÕŒþÈË: Sylvain Bauza < <mailto:sbauza@redhat.com> sbauza@redhat.com> ³ËÍ: <mailto:openstack-discuss@lists.openstack.org> openstack-discuss@lists.openstack.org Ö÷Ìâ: Re: Cyborg nova reports mdev-capable resource is not available Hi Sylvian, Thanks for getting back to me. So the vGPU is available and cyborg is allocating it using ARQ binding. You can see Nova receives this request: 2023-09-21 16:38:51.889 1901814 DEBUG nova.compute.manager [None req-97062e9c-0c44-480e-9918-4a5a810175b2 78e83e5a446e4071ae43e823135dcb3c 21eb701c2a1f48b38dab8f34c0a20902 - - default default] ARQs for spec:{'2d60c353-0419-4b67-8cb7-913fc6f5cef9': {'uuid': '2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'state': 'Bound', 'device_profile_name': 'VGPU_A40-Q48', 'device_profile_group_id': 0, 'hostname': 'gpu-c-01', 'device_rp_uuid': '229bf15f-5689-3d2c-b37b-5c8439ea6a71', 'instance_uuid': '1b090007-791b-4997-af89-0feb886cf11d', 'project_id': None, 'attach_handle_type': 'MDEV', 'attach_handle_uuid': '866bd6a5-b156-4251-a969-64fefb32f16f', 'attach_handle_info': {'asked_type': 'nvidia-566', 'bus': 'ca', 'device': '01', 'domain': '0000', 'function': '1', 'vgpu_mark': 'nvidia-566_0'}, 'links': [{'href': 'http://cyborg-api.openstack.svc.cluster.local:6666/accelerator/v2/accelerat or_requests/2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'rel': 'self'}], 'created_at': '2023-09-21T16:38:42+00:00', 'updated_at': '2023-09-21T16:38:42+00:00'}}, ARQs for network:{} _build_resources /var/lib/openstack/lib/python3.10/site-packages/nova/compute/manager.py:2680 So the mdev is then allocated in the resource providers at that point. Is there some cyborg nova patching code I am missing? From: Sylvain Bauza < <mailto:sbauza@redhat.com> sbauza@redhat.com> Date: Friday, 22 September 2023 at 1:49 am To: Karl Kloppenborg < <mailto:kkloppenborg@resetdata.com.au> kkloppenborg@resetdata.com.au> Cc: <mailto:openstack-discuss@lists.openstack.org> openstack-discuss@lists. openstack.org < <mailto:openstack-discuss@lists.openstack.org> openstack-discuss@lists.openstack.org> Subject: Re: Cyborg nova reports mdev-capable resource is not available Le jeu. 21 sept. 2023 š€ 17:27, Karl Kloppenborg < <mailto:kkloppenborg@resetdata.com.au> kkloppenborg@resetdata.com.au> a šŠ crit : Hi Cyborg Team! Karl from Helm Team. When creating a VM with the correct flavor, the mdev gets created by cyborg agent anád I can see it in the nodedev-list --cap mdev. However Nova then fails with: nova.virt.libvirt.driver [<removed>- - default default] Searching for available mdevs... _get_existing_mdevs_not_assigned /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py :8357 2023-09-21 14:34:47.808 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Available mdevs at: set(). 2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] No available mdevs where found. Creating an new one... _allocate_mdevs /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driv er.py:8496 2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] Attempting to create new mdev... _create_new_mediated_device /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py: 8385 2023-09-21 14:34:48.455 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_4b_03_1', ¡ <truncated list>]. 2023-09-21 14:34:48.456 1901814 ERROR nova.compute.manager [<removed> - - default default] [instance: 2026e2a2-b17a-43ab-adcb-62a907f58b51] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available. I don't exactly remember how Cyborg passes the devices to nova/libvirt but this exception is because none of the available GPUs have either existing mdevs or capability for creating mdevs. You should first check sysfs to double-check the state of our GPU devices in order to understand how much of vGPU capacity you still have. -Sylvain Once this happened, ARQ removes the mdev and cleans up. I¡¯ve got Cyborg 2023.2 running and have a device profile like so: karl@Karls-Air ~ % openstack accelerator device profile show e2b07e11-fe69-4f33-83fc-0f9e38adb7ae +-------------+------------------------------------------------------------- --------------+ | Field | Value | +-------------+------------------------------------------------------------- --------------+ | created_at | 2023-09-21 13:30:05+00:00 | | updated_at | None | | uuid | e2b07e11-fe69-4f33-83fc-0f9e38adb7ae | | name | VGPU_A40-Q48 | | groups | [{'resources:VGPU': '1', 'trait:CUSTOM_NVIDIA_2235_A40_48Q': 'required'}] | | description | None | +-------------+------------------------------------------------------------- --------------+ karl@Karls-Air ~ % I can see the allocation candidate: karl@Karls-Air ~ % openstack allocation candidate list --resource VGPU=1 | grep A40 | 41 | VGPU=1 | 229bf15f-5689-3d2c-b37b-5c8439ea6a71 | VGPU=0/1 | OWNER_CYBORG,CUSTOM_NVIDIA_2235_A40_48Q | karl@Karls-Air ~ % Am I missing something critical here? Because I cannot seem to figure this out¡ have I got a PCI address wrong, or something? Any help from the Cyborg or Nova teams would be fantastic. Thanks, Karl. ------=_NextPart_000_03AC_01D9ED5B.A9ADB240 Content-Type: text/html; charset="gb2312" Content-Transfer-Encoding: quoted-printable <html xmlns:v=3D"urn:schemas-microsoft-com:vml" = xmlns:o=3D"urn:schemas-microsoft-com:office:office" = xmlns:w=3D"urn:schemas-microsoft-com:office:word" = xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" = xmlns=3D"http://www.w3.org/TR/REC-html40"><head><meta = http-equiv=3DContent-Type content=3D"text/html; charset=3Dgb2312"><meta = name=3DGenerator content=3D"Microsoft Word 15 (filtered = medium)"><style><!-- /* Font Definitions */ @font-face {font-family:=CB=CE=CC=E5; panose-1:2 1 6 0 3 1 1 1 1 1;} @font-face {font-family:"MS Gothic"; panose-1:2 11 6 9 7 2 5 8 2 4;} @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:"\@=CB=CE=CC=E5"; panose-1:2 1 6 0 3 1 1 1 1 1;} @font-face {font-family:=CE=A2=C8=ED=D1=C5=BA=DA; panose-1:2 11 5 3 2 2 4 2 2 4;} @font-face {font-family:"\@=CE=A2=C8=ED=D1=C5=BA=DA"; panose-1:2 11 5 3 2 2 4 2 2 4;} @font-face {font-family:"\@MS Gothic"; panose-1:2 11 6 9 7 2 5 8 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0cm; margin-bottom:.0001pt; font-size:10.0pt; font-family:"Calibri",sans-serif;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal; font-family:"Calibri",sans-serif; color:windowtext;} span.EmailStyle18 {mso-style-type:personal-reply; font-family:"Calibri",sans-serif; color:#1F497D;} .MsoChpDefault {mso-style-type:export-only; font-size:10.0pt;} @page WordSection1 {size:612.0pt 792.0pt; margin:72.0pt 72.0pt 72.0pt 72.0pt;} div.WordSection1 {page:WordSection1;} --></style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext=3D"edit"> <o:idmap v:ext=3D"edit" data=3D"1" /> </o:shapelayout></xml><![endif]--></head><body lang=3DZH-CN link=3Dblue = vlink=3Dpurple><div class=3DWordSection1><p class=3DMsoNormal><span = lang=3DEN-US = style=3D'font-size:10.5pt;color:#1F497D'><o:p> </o:p></span></p><p = class=3DMsoNormal><span lang=3DEN-US = style=3D'font-size:10.5pt;color:#1F497D'>Please reference the nvidia = official doc: <a = href=3D"https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#introduc= tion">https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#introducti= on</a> <o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US = style=3D'font-size:10.5pt;color:#1F497D'><o:p> </o:p></span></p><p = class=3DMsoNormal><a name=3D"_MailEndCompose"><span lang=3DEN-US = style=3D'font-size:10.5pt;color:#1F497D'><o:p> </o:p></span></a></p>= <div><div style=3D'border:none;border-top:solid #E1E1E1 = 1.0pt;padding:3.0pt 0cm 0cm 0cm'><p class=3DMsoNormal><b><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>=B7=A2=BC=FE=C8=CB<span lang=3DEN-US>:</span></span></b><span = lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'> Karl Kloppenborg [mailto:kkloppenborg@resetdata.com.au] = <br></span><b><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>=B7=A2=CB=CD=CA=B1=BC=E4<span lang=3DEN-US>:</span></span></b><span = lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'> 2023</span><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>=C4=EA<span lang=3DEN-US>9</span>=D4=C2<span = lang=3DEN-US>22</span>=C8=D5<span lang=3DEN-US> = 10:53<br></span><b>=CA=D5=BC=FE=C8=CB<span = lang=3DEN-US>:</span></b><span lang=3DEN-US> Alex Song = (</span>=CB=CE=CE=C4=C6=BD<span lang=3DEN-US>) = <songwenping@inspur.com>; = sbauza@redhat.com<br></span><b>=B3=AD=CB=CD<span = lang=3DEN-US>:</span></b><span lang=3DEN-US> = openstack-discuss@lists.openstack.org<br></span><b>=D6=F7=CC=E2<span = lang=3DEN-US>:</span></b><span lang=3DEN-US> Re: Cyborg nova reports = mdev-capable resource is not = available<o:p></o:p></span></span></p></div></div><p = class=3DMsoNormal><span lang=3DEN-US><o:p> </o:p></span></p><p = class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>Ah thank you for = pointing me towards that Alex.<o:p></o:p></span></p><p = class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'><o:p> </o:p></= span></p><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>I guess, I should = probably look at the MIG pathway.<o:p></o:p></span></p><p = class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>I wonder if = it=A1=AFs possible to do vGPU profiles in MIG = configuration.<o:p></o:p></span></p><p class=3DMsoNormal><span = lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'><o:p> </o:p></= span></p><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>Have you any = experience with this?<o:p></o:p></span></p><p class=3DMsoNormal><span = lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'><o:p> </o:p></= span></p><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>Thanks,<br>Karl.<o:= p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'><o:p> </o:p></= span></p><div id=3Dmail-editor-reference-message-container><div><div = style=3D'border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm = 0cm 0cm'><p class=3DMsoNormal style=3D'margin-bottom:12.0pt'><b><span = lang=3DEN-AU style=3D'font-size:12.0pt;color:black'>From: = </span></b><span lang=3DEN-AU = style=3D'font-size:12.0pt;color:black'>Alex Song (</span><span = style=3D'font-size:12.0pt;font-family:"MS = Gothic";color:black'>=CB=CE=CE=C4=C6=BD</span><span lang=3DEN-AU = style=3D'font-size:12.0pt;color:black'>) <</span><span = lang=3DEN-US><a href=3D"mailto:songwenping@inspur.com"><span = lang=3DEN-AU = style=3D'font-size:12.0pt'>songwenping@inspur.com</span></a></span><span = lang=3DEN-AU style=3D'font-size:12.0pt;color:black'>><br><b>Date: = </b>Friday, 22 September 2023 at 12:17 pm<br><b>To: </b>Karl Kloppenborg = <</span><span lang=3DEN-US><a = href=3D"mailto:kkloppenborg@resetdata.com.au"><span lang=3DEN-AU = style=3D'font-size:12.0pt'>kkloppenborg@resetdata.com.au</span></a></span= style=3D'font-size:12.0pt'>sbauza@redhat.com</span></a></span><span = lang=3DEN-AU style=3D'font-size:12.0pt;color:black'> <</span><span = lang=3DEN-US><a href=3D"mailto:sbauza@redhat.com"><span lang=3DEN-AU = style=3D'font-size:12.0pt'>sbauza@redhat.com</span></a></span><span = lang=3DEN-AU style=3D'font-size:12.0pt;color:black'>><br><b>Cc: = </b></span><span lang=3DEN-US><a = href=3D"mailto:openstack-discuss@lists.openstack.org"><span lang=3DEN-AU = style=3D'font-size:12.0pt'>openstack-discuss@lists.openstack.org</span></= a></span><span lang=3DEN-AU style=3D'font-size:12.0pt;color:black'> = <</span><span lang=3DEN-US><a = href=3D"mailto:openstack-discuss@lists.openstack.org"><span lang=3DEN-AU = style=3D'font-size:12.0pt'>openstack-discuss@lists.openstack.org</span></= a></span><span lang=3DEN-AU = style=3D'font-size:12.0pt;color:black'>><br><b>Subject: = </b></span><span style=3D'font-size:12.0pt;font-family:"MS = Gothic";color:black'>=B4=F0=B8=B4</span><span lang=3DEN-AU = style=3D'font-size:12.0pt;color:black'>: Cyborg nova reports = mdev-capable resource is not available<o:p></o:p></span></p></div><p = class=3DMsoNormal><span lang=3DEN-US = style=3D'font-size:10.5pt;color:#1F497D'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal><span = lang=3DEN-US style=3D'font-size:10.5pt;color:#1F497D'>Hi = Karl,</span><span lang=3DEN-AU><o:p></o:p></span></p><p = class=3DMsoNormal><span lang=3DEN-US = style=3D'font-size:10.5pt;color:#1F497D'> &n= bsp; Your problem is similar with the bug:</span><span = lang=3DEN-US> <a = href=3D"https://bugs.launchpad.net/nova/+bug/2015892"><span = style=3D'font-size:10.5pt'>https://bugs.launchpad.net/nova/+bug/2015892</= span></a></span><span lang=3DEN-AU><o:p></o:p></span></p><p = class=3DMsoNormal><span lang=3DEN-US = style=3D'font-size:10.5pt;color:#1F497D'> &n= bsp; I guess you don=A1=AFt split the mig if using A serial = card.</span><span lang=3DEN-AU><o:p></o:p></span></p><p = class=3DMsoNormal><span lang=3DEN-US = style=3D'font-size:10.5pt;color:#1F497D'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><div><div = style=3D'border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm = 0cm 0cm'><p class=3DMsoNormal><b><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>=B7=A2=BC=FE=C8=CB</span></b><b><span lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>:</span></b><span lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'> Karl Kloppenborg [</span><span lang=3DEN-US><a = href=3D"mailto:kkloppenborg@resetdata.com.au"><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>mailto:kkloppenborg@resetdata.com.au</span></a></span><span = lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>] <br></span><b><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>=B7=A2=CB=CD=CA=B1=BC=E4</span></b><b><span lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>:</span></b><span lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'> 2023</span><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>=C4=EA</span><span lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>9</span><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>=D4=C2</span><span lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>22</span><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>=C8=D5</span><span lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'> 0:43<br></span><b><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>=CA=D5=BC=FE=C8=CB</span></b><b><span lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>:</span></b><span lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'> Sylvain Bauza <</span><span lang=3DEN-US><a = href=3D"mailto:sbauza@redhat.com"><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>sbauza@redhat.com</span></a></span><span lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>><br></span><b><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>=B3=AD=CB=CD</span></b><b><span lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>:</span></b><span lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'> </span><span lang=3DEN-US><a = href=3D"mailto:openstack-discuss@lists.openstack.org"><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>openstack-discuss@lists.openstack.org</span></a></span><span = lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'><br></span><b><span = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>=D6=F7=CC=E2</span></b><b><span lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'>:</span></b><span lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"=CE=A2=C8=ED=D1=C5=BA=DA",sans-ser= if'> Re: Cyborg nova reports mdev-capable resource is not = available</span><span lang=3DEN-AU><o:p></o:p></span></p></div></div><p = class=3DMsoNormal><span lang=3DEN-US> </span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal><span = lang=3DEN-AU style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>Hi = Sylvian,</span><span lang=3DEN-AU><o:p></o:p></span></p><p = class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal><span = lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>Thanks for getting = back to me.</span><span lang=3DEN-AU><o:p></o:p></span></p><p = class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>So the vGPU is = available and cyborg is allocating it using ARQ binding.</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal><span = lang=3DEN-AU style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>You = can see Nova receives this request:</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal><span = lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal><span = lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>2023-09-21 = 16:38:51.889 1901814 DEBUG nova.compute.manager [None = req-97062e9c-0c44-480e-9918-4a5a810175b2 = 78e83e5a446e4071ae43e823135dcb3c 21eb701c2a1f48b38dab8f34c0a20902 - - = default default] ARQs for spec:{'2d60c353-0419-4b67-8cb7-913fc6f5cef9': = {'uuid': '2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'state': 'Bound', = 'device_profile_name': 'VGPU_A40-Q48', 'device_profile_group_id': 0, = 'hostname': 'gpu-c-01', 'device_rp_uuid': = '229bf15f-5689-3d2c-b37b-5c8439ea6a71', 'instance_uuid': = '1b090007-791b-4997-af89-0feb886cf11d', 'project_id': None, = 'attach_handle_type': 'MDEV', 'attach_handle_uuid': = '866bd6a5-b156-4251-a969-64fefb32f16f', 'attach_handle_info': = {'asked_type': 'nvidia-566', 'bus': 'ca', 'device': '01', 'domain': = '0000', 'function': '1', 'vgpu_mark': 'nvidia-566_0'}, 'links': = [{'href': = 'http://cyborg-api.openstack.svc.cluster.local:6666/accelerator/v2/accele= rator_requests/2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'rel': 'self'}], = 'created_at': '2023-09-21T16:38:42+00:00', 'updated_at': = '2023-09-21T16:38:42+00:00'}}, ARQs for network:{} _build_resources = /var/lib/openstack/lib/python3.10/site-packages/nova/compute/manager.py:2= 680</span><span lang=3DEN-AU><o:p></o:p></span></p><p = class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal><span = lang=3DEN-AU style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>So = the mdev is then allocated in the resource providers at that = point.</span><span lang=3DEN-AU><o:p></o:p></span></p><p = class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal><span = lang=3DEN-AU style=3D'font-size:11.0pt;mso-fareast-language:EN-US'>Is = there some cyborg nova patching code I am missing?</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal><span = lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal><span = lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal><span = lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal><span = lang=3DEN-AU = style=3D'font-size:11.0pt;mso-fareast-language:EN-US'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><div = id=3Dmail-editor-reference-message-container><div><div = style=3D'border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm = 0cm 0cm'><p class=3DMsoNormal style=3D'margin-bottom:12.0pt'><b><span = lang=3DEN-AU style=3D'font-size:12.0pt;color:black'>From: = </span></b><span lang=3DEN-AU = style=3D'font-size:12.0pt;color:black'>Sylvain Bauza <</span><span = lang=3DEN-US><a href=3D"mailto:sbauza@redhat.com"><span lang=3DEN-GB = style=3D'font-size:12.0pt'>sbauza@redhat.com</span></a></span><span = lang=3DEN-AU style=3D'font-size:12.0pt;color:black'>><br><b>Date: = </b>Friday, 22 September 2023 at 1:49 am<br><b>To: </b>Karl Kloppenborg = <</span><span lang=3DEN-US><a = href=3D"mailto:kkloppenborg@resetdata.com.au"><span lang=3DEN-GB = style=3D'font-size:12.0pt'>kkloppenborg@resetdata.com.au</span></a></span= style=3D'font-size:12.0pt;color:black'>><br><b>Cc: </b></span><span = lang=3DEN-US><a = href=3D"mailto:openstack-discuss@lists.openstack.org"><span lang=3DEN-GB = style=3D'font-size:12.0pt'>openstack-discuss@lists.openstack.org</span></= a></span><span lang=3DEN-AU style=3D'font-size:12.0pt;color:black'> = <</span><span lang=3DEN-US><a = href=3D"mailto:openstack-discuss@lists.openstack.org"><span lang=3DEN-GB = style=3D'font-size:12.0pt'>openstack-discuss@lists.openstack.org</span></= a></span><span lang=3DEN-AU = style=3D'font-size:12.0pt;color:black'>><br><b>Subject: </b>Re: = Cyborg nova reports mdev-capable resource is not available</span><span = lang=3DEN-AU><o:p></o:p></span></p></div><div><div><p = class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt'> </span><span = lang=3DEN-AU><o:p></o:p></span></p></div><p class=3DMsoNormal><span = lang=3DEN-AU style=3D'font-size:11.0pt'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><div><div><p class=3DMsoNormal><span = lang=3DEN-AU style=3D'font-size:11.0pt'>Le jeu. 21 sept. 2023 = =A8=A4 17:27, Karl Kloppenborg <</span><span lang=3DEN-US><a = href=3D"mailto:kkloppenborg@resetdata.com.au"><span lang=3DEN-GB = style=3D'font-size:11.0pt'>kkloppenborg@resetdata.com.au</span></a></span= style=3D'border:none;border-left:solid #CCCCCC 1.0pt;padding:0cm 0cm 0cm = 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0cm;margin-bottom:5= .0pt'><div><div><div><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>Hi Cyborg = Team!</span><span lang=3DEN-AU><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>Karl from Helm = Team.</span><span lang=3DEN-AU><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>When creating a VM = with the correct flavor, the mdev gets created by cyborg agent and I can = see it in the nodedev-list --cap mdev.</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>However Nova then = fails with:</span><span lang=3DEN-AU><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU = style=3D'font-size:11.0pt;color:#1D1D1D'>nova.virt.libvirt.driver = [<removed>- - default default] Searching for available mdevs... = _get_existing_mdevs_not_assigned = /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.= py</span><span lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>:8357</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>2023-09-21 = 14:34:47.808 1901814 INFO nova.virt.libvirt.driver [<removed> - - = default default] Available mdevs at: set().</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>2023-09-21 = 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - = default default] No available mdevs where found. Creating an new one... = _allocate_mdevs = /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driv</s= pan><span lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU = style=3D'font-size:11.0pt;color:#1D1D1D'>er.py:8496</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>2023-09-21 = 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - = default default] Attempting to create new mdev... = _create_new_mediated_device = /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.= py:8385</span><span lang=3DEN-AU><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>2023-09-21 = 14:34:48.455 1901814 INFO nova.virt.libvirt.driver [<removed> - - = default default] Failed to create mdev. No free space found among the = following devices: ['pci_0000_4b_03_1', =A1=AD <truncated = list>].</span><span lang=3DEN-AU><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>2023-09-21 = 14:34:48.456 1901814 ERROR nova.compute.manager [<removed> - - = default default] [instance: 2026e2a2-b17a-43ab-adcb-62a907f58b51] = Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: = Insufficient compute resources: mdev-capable resource is not = available.</span><span lang=3DEN-AU><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'> </span><span = lang=3DEN-AU><o:p></o:p></span></p></div></div></div></blockquote><div><p= class=3DMsoNormal><span lang=3DEN-AU = style=3D'font-size:11.0pt'> </span><span = lang=3DEN-AU><o:p></o:p></span></p></div><div><p class=3DMsoNormal><span = lang=3DEN-AU style=3D'font-size:11.0pt'>I don't exactly remember how = Cyborg passes the devices to nova/libvirt but this exception is because = none of the available GPUs have either existing mdevs or capability for = creating mdevs.</span><span = lang=3DEN-AU><o:p></o:p></span></p></div><div><p class=3DMsoNormal><span = lang=3DEN-AU style=3D'font-size:11.0pt'>You should first check sysfs to = double-check the state of our GPU devices in order to understand how = much of vGPU capacity you still have. </span><span = lang=3DEN-AU><o:p></o:p></span></p></div><div><p class=3DMsoNormal><span = lang=3DEN-AU style=3D'font-size:11.0pt'> </span><span = lang=3DEN-AU><o:p></o:p></span></p></div><div><p class=3DMsoNormal><span = lang=3DEN-AU style=3D'font-size:11.0pt'>-Sylvain</span><span = lang=3DEN-AU><o:p></o:p></span></p></div><div><p class=3DMsoNormal><span = lang=3DEN-AU style=3D'font-size:11.0pt'> </span><span = lang=3DEN-AU><o:p></o:p></span></p></div><blockquote = style=3D'border:none;border-left:solid #CCCCCC 1.0pt;padding:0cm 0cm 0cm = 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0cm;margin-bottom:5= .0pt'><div><div><div><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>Once this = happened, ARQ removes the mdev and cleans up.</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>I=A1=AFve got = Cyborg 2023.2 running and have a device profile like so:</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>karl@Karls-Air ~ % = openstack accelerator device profile show = e2b07e11-fe69-4f33-83fc-0f9e38adb7ae</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU = style=3D'font-size:11.0pt;color:#1D1D1D'>+-------------+-----------------= ----------------------------------------------------------+</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>| = Field | = Value &n= bsp; &nb= sp; &nbs= p;  = ; = |</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU = style=3D'font-size:11.0pt;color:#1D1D1D'>+-------------+-----------------= ----------------------------------------------------------+</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>| created_at = | 2023-09-21 = 13:30:05+00:00  = ; = &= nbsp; &n= bsp; |</span><span lang=3DEN-AU><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>| updated_at = | = None &nb= sp; &nbs= p;  = ; = &= nbsp; |</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>| = uuid | = e2b07e11-fe69-4f33-83fc-0f9e38adb7ae &= nbsp; &n= bsp; &nb= sp; |</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>| = name | = VGPU_A40-Q48 &= nbsp; &n= bsp; &nb= sp; &nbs= p;  = ; |</span><span lang=3DEN-AU><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>| = groups | [{'resources:VGPU': '1', = 'trait:CUSTOM_NVIDIA_2235_A40_48Q': 'required'}] |</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>| description | = None &nb= sp; &nbs= p;  = ; = &= nbsp; |</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU = style=3D'font-size:11.0pt;color:#1D1D1D'>+-------------+-----------------= ----------------------------------------------------------+</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>karl@Karls-Air ~ = %</span><span lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>I can see the = allocation candidate:</span><span lang=3DEN-AU><o:p></o:p></span></p><p = class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>karl@Karls-Air ~ % = openstack allocation candidate list --resource VGPU=3D1 | grep = A40</span><span lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>| 41 | = VGPU=3D1 | 229bf15f-5689-3d2c-b37b-5c8439ea6a71 = | = VGPU=3D0/1 &nb= sp; | OWNER_CYBORG,CUSTOM_NVIDIA_2235_A40_48Q = |</span><span lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>karl@Karls-Air ~ = %</span><span lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal É= style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>Am I missing = something critical here? Because I cannot seem to figure this out=A1=AD = have I got a PCI address wrong, or something?</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'>Any help from the = Cyborg or Nova teams would be fantastic.</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt;color:#1D1D1D'> </span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU = style=3D'font-size:11.0pt;color:#1D1D1D'>Thanks,<br>Karl.</span><span = lang=3DEN-AU><o:p></o:p></span></p><p class=3DMsoNormal = style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span = lang=3DEN-AU style=3D'font-size:11.0pt'> </span><span = lang=3DEN-AU><o:p></o:p></span></p></div></div></div></blockquote></div><= /div></div></div></div></div></div></body></html> ------=_NextPart_000_03AC_01D9ED5B.A9ADB240-- í0¢0 cÊR,S8vMÂä:ÐVê0 *H÷ 0Y10 &ò,dcom10 &ò,dlangchao10 &ò,dhome10U INSPUR-CA0 170109092830Z 270109093829Z0Y10 &ò,dcom10 &ò,dlangchao10 &ò,dhome10U INSPUR-CA0"0 *H÷ 0 «ä5ïc$Œ©æ'µ¯Þ6>úUKÛdÔ²Áe9Î~{BîÒLgD÷*wvVÊŠ/DýUj_xá\m/ óž=kзéGÙœQ€ýx~Wùgk ÛÜøãÔ7É6NçÏ*?n°Ê²mhùè{ïôÌÆ 7üF-Î<@ÃÓͬWçÅyåLZrF 6~føÈ×T~$0d¡ýL|zšøW=ötÚ%ýq,¥Ã~Ÿ" ÀýŸÑö2T,QÕÔ,dºÂÅ^§ÈôïåJ)ëVvp Ó£f0d0 +7CA0U0Uÿ0ÿ0U^YŠŽLX`Nöµ¥9Š2Á5j0 +70 *H÷ JÄß»íu+¶ Wù«áMÈL¯i8y1ü'áRæL€uK·u`K=9@žnA0׊qÓÂéÑVxÓf Ý»Åü<ÔUrpŠz¶Ï7vèÑ¥`oñ÷7Ä°ù¬?Þ`PMKŠžp»9 ìcwg UhÁþuu|6~ŒÁècŸŸkl!êkMÔS3B±Æù¯:«ççA^5"ÆX¯ èË&á)PNäŒØkäA-7fGR,¿äŸœÞ}ÖfeKéK¥[ H¢WžÓ{oÊù ,F_5°ÎZË>u0C0+ ~+mçbí++0 *H÷ 0Y10 &ò,dcom10 &ò,dlangchao10 &ò,dhome10U INSPUR-CA0 191128064731Z 241126064731Z010 &ò,dcom10 &ò,dlangchao10 &ò,dhome10U浪朮信æ¯10U å®æå¹³1%0# *H÷ songwenping@inspur.com0"0 *H÷ 0 ã4ÒEd$i÷18#,åH~èyµþJRáLêg¡$@6vÔ<sâelPO4ýooXZç1¢iéÁ IòKjÚO§h;Rfùê7Ÿp¬ŒæäÝPòüÿ~æ E}é{có» (ù'+zÞ s ,!K£;û¥¶DEu1+ðÏ»õØÛähJjŸj']a¥Î·7¡õòhõßHŒ8QzÆ{ËÍ73>, ªí|gœ0y€Xv'ÚÊ ¬ÀO#)³°øÕM^œ¢VñóŸ¯ìFñ;µ5HŽù£Ã0¿0= +700.&+7ò©×z©=÷Ø\Jý&§Md\0)U%"0 ++ +7 0U 05 +7 (0&0 +0 +0 +7 0D *H÷ 7050*H÷ 0*H÷ 0+0 *H÷ 0UW 0²^qûìÇUã¡b[]s0U#0^YŠŽLX`Nöµ¥9Š2Á5j0U00ÿ ü ùºldap:///CN=INSPUR-CA,CN=JTCA2012,CN=CDP,CN=Public%20Key%20Services,CN=Services,CN=Configuration,DC=home,DC=langchao,DC=com?certificateRevocationList?base?objectClass=cRLDistributionPoint:http://JTCA2012.home.langchao.com/CertEnroll/INSPUR-CA.crl0)+00±+0€ldap:///CN=INSPUR-CA,CN=AIA,CN=Public%20Key%20Services,CN=Services,CN=Configuration,DC=home,DC=langchao,DC=com?cACertificate?base?objectClass=certificationAuthority0a+0Uhttp://JTCA2012.home.langchao.com/CertEnroll/JTCA2012.home.langchao.com_INSPUR-CA.crt0IUB0@ & +7 songwenping@inspur.comsongwenping@inspur.com0 *H÷ §öãý+·>³â¿Xíe-÷¯[ÂΩÉ\ÃAæMà.U'ù||Wø$Šn ' `9ݵ¯éÉM^mz0Äð -AÒ¯×Tï+£§¡RbfzÆX#ùkY<ØDþxólõA%ÑøN¬×grÛÚ·ãF!GþóüVœÍ±Ð[ÏL®GÅ3€$ /DžÃiú_h|cI}DôþyªÉñ Ç}³W$_Äáy]ò±DÄ1(tr)6gêäÑæÄLd$!t{x±Z±·þÏeqmkþÒÒ÷r³ßxõ%lUÉÒ4d100p0Y10 &ò,dcom10 &ò,dlangchao10 &ò,dhome10U INSPUR-CA~+mçbí++0 + ø0 *H÷ 1 *H÷ 0 *H÷ 1 230922054952Z0# *H÷ 1;ý¯:RŒ¥ØwÒ&ŒIq~»0 +71r0p0Y10 &ò,dcom10 &ò,dlangchao10 &ò,dhome10U INSPUR-CA~+mçbí++0*H÷ 1r p0Y10 &ò,dcom10 &ò,dlangchao10 &ò,dhome10U INSPUR-CA~+mçbí++0 *H÷ 1 00 `He*0 `He0 *H÷ 0 `He0*H÷ 0 *H÷ @0+0 `He0 `He0 `He0 *H÷ ÎyKT[º{CÚ§ÿrØ|·ü2Îá.Ên²Á3f."+ïCGQ&ÙBäZɿκáÑú6Ë£Á·J¢Š±4ŽúŽBC?N|äjNPQË*šCÆ`8ów'[³68Œ0ª=žo £ÙÚ1M© L ÷mcGûeëö ÔEVeÀ[šKP1áæäY6no5ó÷a©Ö#Q-Ì»¯U1ÆÜp^*AÅ>ø7«±è[V8BmÁÌ?]ñä}@®^Ù~ÚÙ.Œ9ÚâG±LÃiºXö ÞÿA÷vH¬Ô
Hi, sorry for stepping in so late. We have seen a very similar issue to https://bugs.launchpad.net/nova/+bug/2015892 and the one described here. Our case is maybe a bit different: we have a box with 4 A100 cards, and we wanted to set it up with MIG enabled on all of them. What we've seen was that more mediated devices were created than actually available, and that (both nova and cyborg) created too many resource providers in placement. Then, depending on which resource provider was selected, the VM failed to spawn because the selected devices was not available. We've now a patch for Nova for this which seems to work for our use case. The basic idea of the patch is to check how many instances are available for each of the physical GPUs (from the description field of one of the devices, e.g. /sys/class/mdev_bus/0000\:81\:00.4/mdev_supported_types/nvidia-476/description) and ensure that only this number of resource providers is created in placement. A side effect is that it also works if the GPUs in the hypervisor have different MIG setups (with the restriction that each card must only be setup with a single MIG type e.g. 2g.10gb,2g.10gb,2g.10gb for GPU0 and 3g.20gb,3g.20gb for GPU1 etc). It's not perfect, and in particular I have not tested if this breaks the regular VGPU setup for time based slicing of the GPUs nor how to do this with Cyborg. The VGPU type selection is done via a trait and a custom resource. If you're interested, feel free to get back to me. Cheers, Ulrich On 22/09/2023 07:49, Alex Song (宋文平) wrote:
Please reference the nvidia official doc: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#introduction
*发件人:*Karl Kloppenborg [mailto:kkloppenborg@resetdata.com.au] *发送时间:*2023年9月22日10:53 *收件人:*Alex Song (宋文平) <songwenping@inspur.com>; sbauza@redhat.com *抄送:*openstack-discuss@lists.openstack.org *主题:*Re: Cyborg nova reports mdev-capable resource is not available
Ah thank you for pointing me towards that Alex.
I guess, I should probably look at the MIG pathway.
I wonder if it’s possible to do vGPU profiles in MIG configuration.
Have you any experience with this?
Thanks, Karl.
*From: *Alex Song (宋文平) <songwenping@inspur.com <mailto:songwenping@inspur.com>> *Date: *Friday, 22 September 2023 at 12:17 pm *To: *Karl Kloppenborg <kkloppenborg@resetdata.com.au <mailto:kkloppenborg@resetdata.com.au>>, sbauza@redhat.com <mailto:sbauza@redhat.com><sbauza@redhat.com <mailto:sbauza@redhat.com>> *Cc: *openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org><openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org>> *Subject: *答复: Cyborg nova reports mdev-capable resource is not available
Hi Karl,
Your problem is similar with the bug:https://bugs.launchpad.net/nova/+bug/2015892 <https://bugs.launchpad.net/nova/+bug/2015892>
I guess you don’t split the mig if using A serial card.
*发件人**:*Karl Kloppenborg [mailto:kkloppenborg@resetdata.com.au <mailto:kkloppenborg@resetdata.com.au>] *发送时间**:*2023年9月22日0:43 *收件人**:*Sylvain Bauza <sbauza@redhat.com <mailto:sbauza@redhat.com>> *抄送**:*openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org> *主题**:*Re: Cyborg nova reports mdev-capable resource is not available
Hi Sylvian,
Thanks for getting back to me.
So the vGPU is available and cyborg is allocating it using ARQ binding.
You can see Nova receives this request:
2023-09-21 16:38:51.889 1901814 DEBUG nova.compute.manager [None req-97062e9c-0c44-480e-9918-4a5a810175b2 78e83e5a446e4071ae43e823135dcb3c 21eb701c2a1f48b38dab8f34c0a20902 - - default default] ARQs for spec:{'2d60c353-0419-4b67-8cb7-913fc6f5cef9': {'uuid': '2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'state': 'Bound', 'device_profile_name': 'VGPU_A40-Q48', 'device_profile_group_id': 0, 'hostname': 'gpu-c-01', 'device_rp_uuid': '229bf15f-5689-3d2c-b37b-5c8439ea6a71', 'instance_uuid': '1b090007-791b-4997-af89-0feb886cf11d', 'project_id': None, 'attach_handle_type': 'MDEV', 'attach_handle_uuid': '866bd6a5-b156-4251-a969-64fefb32f16f', 'attach_handle_info': {'asked_type': 'nvidia-566', 'bus': 'ca', 'device': '01', 'domain': '0000', 'function': '1', 'vgpu_mark': 'nvidia-566_0'}, 'links': [{'href': 'http://cyborg-api.openstack.svc.cluster.local:6666/accelerator/v2/accelerato...', 'rel': 'self'}], 'created_at': '2023-09-21T16:38:42+00:00', 'updated_at': '2023-09-21T16:38:42+00:00'}}, ARQs for network:{} _build_resources /var/lib/openstack/lib/python3.10/site-packages/nova/compute/manager.py:2680
So the mdev is then allocated in the resource providers at that point.
Is there some cyborg nova patching code I am missing?
*From: *Sylvain Bauza <sbauza@redhat.com <mailto:sbauza@redhat.com>> *Date: *Friday, 22 September 2023 at 1:49 am *To: *Karl Kloppenborg <kkloppenborg@resetdata.com.au <mailto:kkloppenborg@resetdata.com.au>> *Cc: *openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org><openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org>> *Subject: *Re: Cyborg nova reports mdev-capable resource is not available
Le jeu. 21 sept. 2023 à 17:27, Karl Kloppenborg <kkloppenborg@resetdata.com.au <mailto:kkloppenborg@resetdata.com.au>> a écrit :
Hi Cyborg Team!
Karl from Helm Team.
When creating a VM with the correct flavor, the mdev gets created by cyborg agent and I can see it in the nodedev-list --cap mdev.
However Nova then fails with:
nova.virt.libvirt.driver [<removed>- - default default] Searching for available mdevs... _get_existing_mdevs_not_assigned /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py
:8357
2023-09-21 14:34:47.808 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Available mdevs at: set().
2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] No available mdevs where found. Creating an new one... _allocate_mdevs /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driv
er.py:8496
2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] Attempting to create new mdev... _create_new_mediated_device /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py:8385
2023-09-21 14:34:48.455 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_4b_03_1', … <truncated list>].
2023-09-21 14:34:48.456 1901814 ERROR nova.compute.manager [<removed> - - default default] [instance: 2026e2a2-b17a-43ab-adcb-62a907f58b51] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.
I don't exactly remember how Cyborg passes the devices to nova/libvirt but this exception is because none of the available GPUs have either existing mdevs or capability for creating mdevs.
You should first check sysfs to double-check the state of our GPU devices in order to understand how much of vGPU capacity you still have.
-Sylvain
Once this happened, ARQ removes the mdev and cleans up.
I’ve got Cyborg 2023.2 running and have a device profile like so:
karl@Karls-Air ~ % openstack accelerator device profile show e2b07e11-fe69-4f33-83fc-0f9e38adb7ae
+-------------+---------------------------------------------------------------------------+
| Field | Value |
+-------------+---------------------------------------------------------------------------+
| created_at | 2023-09-21 13:30:05+00:00 |
| updated_at | None |
| uuid | e2b07e11-fe69-4f33-83fc-0f9e38adb7ae |
| name | VGPU_A40-Q48 |
| groups | [{'resources:VGPU': '1', 'trait:CUSTOM_NVIDIA_2235_A40_48Q': 'required'}] |
| description | None |
+-------------+---------------------------------------------------------------------------+
karl@Karls-Air ~ %
I can see the allocation candidate:
karl@Karls-Air ~ % openstack allocation candidate list --resource VGPU=1 | grep A40
| 41 | VGPU=1 | 229bf15f-5689-3d2c-b37b-5c8439ea6a71 | VGPU=0/1 | OWNER_CYBORG,CUSTOM_NVIDIA_2235_A40_48Q |
karl@Karls-Air ~ %
Am I missing something critical here? Because I cannot seem to figure this out… have I got a PCI address wrong, or something?
Any help from the Cyborg or Nova teams would be fantastic.
Thanks, Karl.
Hi, Sylvain, thanks for the link, I wasn't aware of https://review.opendev.org/q/topic:%22bug/2041519%22, will take a look. In the mean time I put our patch in https://cernbox.cern.ch/s/IgDExL99wyNjOfc (against Yoga) - Ulrich On 07/12/2023 14:18, Ulrich Schwickerath wrote:
Hi,
sorry for stepping in so late. We have seen a very similar issue to https://bugs.launchpad.net/nova/+bug/2015892 and the one described here. Our case is maybe a bit different: we have a box with 4 A100 cards, and we wanted to set it up with MIG enabled on all of them. What we've seen was that more mediated devices were created than actually available, and that (both nova and cyborg) created too many resource providers in placement. Then, depending on which resource provider was selected, the VM failed to spawn because the selected devices was not available. We've now a patch for Nova for this which seems to work for our use case. The basic idea of the patch is to check how many instances are available for each of the physical GPUs (from the description field of one of the devices, e.g. /sys/class/mdev_bus/0000\:81\:00.4/mdev_supported_types/nvidia-476/description) and ensure that only this number of resource providers is created in placement. A side effect is that it also works if the GPUs in the hypervisor have different MIG setups (with the restriction that each card must only be setup with a single MIG type e.g. 2g.10gb,2g.10gb,2g.10gb for GPU0 and 3g.20gb,3g.20gb for GPU1 etc). It's not perfect, and in particular I have not tested if this breaks the regular VGPU setup for time based slicing of the GPUs nor how to do this with Cyborg. The VGPU type selection is done via a trait and a custom resource. If you're interested, feel free to get back to me.
Cheers, Ulrich
On 22/09/2023 07:49, Alex Song (宋文平) wrote:
Please reference the nvidia official doc: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#introduction
*发件人:*Karl Kloppenborg [mailto:kkloppenborg@resetdata.com.au] *发送时间:*2023年9月22日10:53 *收件人:*Alex Song (宋文平) <songwenping@inspur.com>; sbauza@redhat.com *抄送:*openstack-discuss@lists.openstack.org *主题:*Re: Cyborg nova reports mdev-capable resource is not available
Ah thank you for pointing me towards that Alex.
I guess, I should probably look at the MIG pathway.
I wonder if it’s possible to do vGPU profiles in MIG configuration.
Have you any experience with this?
Thanks, Karl.
*From: *Alex Song (宋文平) <songwenping@inspur.com <mailto:songwenping@inspur.com>> *Date: *Friday, 22 September 2023 at 12:17 pm *To: *Karl Kloppenborg <kkloppenborg@resetdata.com.au <mailto:kkloppenborg@resetdata.com.au>>, sbauza@redhat.com <mailto:sbauza@redhat.com><sbauza@redhat.com <mailto:sbauza@redhat.com>> *Cc: *openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org><openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org>> *Subject: *答复: Cyborg nova reports mdev-capable resource is not available
Hi Karl,
Your problem is similar with the bug:https://bugs.launchpad.net/nova/+bug/2015892 <https://bugs.launchpad.net/nova/+bug/2015892>
I guess you don’t split the mig if using A serial card.
*发件人**:*Karl Kloppenborg [mailto:kkloppenborg@resetdata.com.au <mailto:kkloppenborg@resetdata.com.au>] *发送时间**:*2023年9月22日0:43 *收件人**:*Sylvain Bauza <sbauza@redhat.com <mailto:sbauza@redhat.com>> *抄送**:*openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org> *主题**:*Re: Cyborg nova reports mdev-capable resource is not available
Hi Sylvian,
Thanks for getting back to me.
So the vGPU is available and cyborg is allocating it using ARQ binding.
You can see Nova receives this request:
2023-09-21 16:38:51.889 1901814 DEBUG nova.compute.manager [None req-97062e9c-0c44-480e-9918-4a5a810175b2 78e83e5a446e4071ae43e823135dcb3c 21eb701c2a1f48b38dab8f34c0a20902 - - default default] ARQs for spec:{'2d60c353-0419-4b67-8cb7-913fc6f5cef9': {'uuid': '2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'state': 'Bound', 'device_profile_name': 'VGPU_A40-Q48', 'device_profile_group_id': 0, 'hostname': 'gpu-c-01', 'device_rp_uuid': '229bf15f-5689-3d2c-b37b-5c8439ea6a71', 'instance_uuid': '1b090007-791b-4997-af89-0feb886cf11d', 'project_id': None, 'attach_handle_type': 'MDEV', 'attach_handle_uuid': '866bd6a5-b156-4251-a969-64fefb32f16f', 'attach_handle_info': {'asked_type': 'nvidia-566', 'bus': 'ca', 'device': '01', 'domain': '0000', 'function': '1', 'vgpu_mark': 'nvidia-566_0'}, 'links': [{'href': 'http://cyborg-api.openstack.svc.cluster.local:6666/accelerator/v2/accelerato...', 'rel': 'self'}], 'created_at': '2023-09-21T16:38:42+00:00', 'updated_at': '2023-09-21T16:38:42+00:00'}}, ARQs for network:{} _build_resources /var/lib/openstack/lib/python3.10/site-packages/nova/compute/manager.py:2680
So the mdev is then allocated in the resource providers at that point.
Is there some cyborg nova patching code I am missing?
*From: *Sylvain Bauza <sbauza@redhat.com <mailto:sbauza@redhat.com>> *Date: *Friday, 22 September 2023 at 1:49 am *To: *Karl Kloppenborg <kkloppenborg@resetdata.com.au <mailto:kkloppenborg@resetdata.com.au>> *Cc: *openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org><openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org>> *Subject: *Re: Cyborg nova reports mdev-capable resource is not available
Le jeu. 21 sept. 2023 à 17:27, Karl Kloppenborg <kkloppenborg@resetdata.com.au <mailto:kkloppenborg@resetdata.com.au>> a écrit :
Hi Cyborg Team!
Karl from Helm Team.
When creating a VM with the correct flavor, the mdev gets created by cyborg agent and I can see it in the nodedev-list --cap mdev.
However Nova then fails with:
nova.virt.libvirt.driver [<removed>- - default default] Searching for available mdevs... _get_existing_mdevs_not_assigned /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py
:8357
2023-09-21 14:34:47.808 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Available mdevs at: set().
2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] No available mdevs where found. Creating an new one... _allocate_mdevs /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driv
er.py:8496
2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] Attempting to create new mdev... _create_new_mediated_device /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py:8385
2023-09-21 14:34:48.455 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_4b_03_1', … <truncated list>].
2023-09-21 14:34:48.456 1901814 ERROR nova.compute.manager [<removed> - - default default] [instance: 2026e2a2-b17a-43ab-adcb-62a907f58b51] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.
I don't exactly remember how Cyborg passes the devices to nova/libvirt but this exception is because none of the available GPUs have either existing mdevs or capability for creating mdevs.
You should first check sysfs to double-check the state of our GPU devices in order to understand how much of vGPU capacity you still have.
-Sylvain
Once this happened, ARQ removes the mdev and cleans up.
I’ve got Cyborg 2023.2 running and have a device profile like so:
karl@Karls-Air ~ % openstack accelerator device profile show e2b07e11-fe69-4f33-83fc-0f9e38adb7ae
+-------------+---------------------------------------------------------------------------+
| Field | Value |
+-------------+---------------------------------------------------------------------------+
| created_at | 2023-09-21 13:30:05+00:00 |
| updated_at | None |
| uuid | e2b07e11-fe69-4f33-83fc-0f9e38adb7ae |
| name | VGPU_A40-Q48 |
| groups | [{'resources:VGPU': '1', 'trait:CUSTOM_NVIDIA_2235_A40_48Q': 'required'}] |
| description | None |
+-------------+---------------------------------------------------------------------------+
karl@Karls-Air ~ %
I can see the allocation candidate:
karl@Karls-Air ~ % openstack allocation candidate list --resource VGPU=1 | grep A40
| 41 | VGPU=1 | 229bf15f-5689-3d2c-b37b-5c8439ea6a71 | VGPU=0/1 | OWNER_CYBORG,CUSTOM_NVIDIA_2235_A40_48Q |
karl@Karls-Air ~ %
Am I missing something critical here? Because I cannot seem to figure this out… have I got a PCI address wrong, or something?
Any help from the Cyborg or Nova teams would be fantastic.
Thanks, Karl.
Le jeu. 7 déc. 2023 à 17:20, Ulrich Schwickerath < Ulrich.Schwickerath@cern.ch> a écrit :
Hi, Sylvain,
thanks for the link, I wasn't aware of https://review.opendev.org/q/topic:%22bug/2041519%22, will take a look. In the mean time I put our patch in https://cernbox.cern.ch/s/IgDExL99wyNjOfc (against Yoga)
Please be aware that the description field you read for getting the number of max instances can't really be useful on a generic manner, as this is a proprietary blob, not formally defined, provided by nvidia. My strategy is a bit different from yours, I'm asking the operator to set the max instances directly. This way, it requires a service restart which will cleanup the useless RPs. By your patch, the RP is deleted when the next periodic run happens, which could be slightly later after the last mdev is created. FWIW, the workaround (until we merge the bugfix) that works with any post-Train release is to pre-create all of the mediated devices and not let Nova create any mdev. This way, no useless RP is created. - Ulrich
On 07/12/2023 14:18, Ulrich Schwickerath wrote:
Hi,
sorry for stepping in so late. We have seen a very similar issue to https://bugs.launchpad.net/nova/+bug/2015892 and the one described here. Our case is maybe a bit different: we have a box with 4 A100 cards, and we wanted to set it up with MIG enabled on all of them. What we've seen was that more mediated devices were created than actually available, and that (both nova and cyborg) created too many resource providers in placement. Then, depending on which resource provider was selected, the VM failed to spawn because the selected devices was not available. We've now a patch for Nova for this which seems to work for our use case. The basic idea of the patch is to check how many instances are available for each of the physical GPUs (from the description field of one of the devices, e.g.
/sys/class/mdev_bus/0000\:81\:00.4/mdev_supported_types/nvidia-476/description)
and ensure that only this number of resource providers is created in placement. A side effect is that it also works if the GPUs in the hypervisor have different MIG setups (with the restriction that each card must only be setup with a single MIG type e.g. 2g.10gb,2g.10gb,2g.10gb for GPU0 and 3g.20gb,3g.20gb for GPU1 etc). It's not perfect, and in particular I have not tested if this breaks the regular VGPU setup for time based slicing of the GPUs nor how to do this with Cyborg. The VGPU type selection is done via a trait and a custom resource. If you're interested, feel free to get back to me.
Cheers, Ulrich
On 22/09/2023 07:49, Alex Song (宋文平) wrote:
Please reference the nvidia official doc: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#introduction
*发件人:*Karl Kloppenborg [mailto:kkloppenborg@resetdata.com.au] *发送时间:*2023年9月22日10:53 *收件人:*Alex Song (宋文平) <songwenping@inspur.com>; sbauza@redhat.com *抄送:*openstack-discuss@lists.openstack.org *主题:*Re: Cyborg nova reports mdev-capable resource is not available
Ah thank you for pointing me towards that Alex.
I guess, I should probably look at the MIG pathway.
I wonder if it’s possible to do vGPU profiles in MIG configuration.
Have you any experience with this?
Thanks, Karl.
*From: *Alex Song (宋文平) <songwenping@inspur.com <mailto:songwenping@inspur.com>> *Date: *Friday, 22 September 2023 at 12:17 pm *To: *Karl Kloppenborg <kkloppenborg@resetdata.com.au <mailto:kkloppenborg@resetdata.com.au>>, sbauza@redhat.com <mailto:sbauza@redhat.com><sbauza@redhat.com <mailto:sbauza@redhat.com
*Cc: *openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org><
openstack-discuss@lists.openstack.org
<mailto:openstack-discuss@lists.openstack.org>> *Subject: *答复: Cyborg nova reports mdev-capable resource is not available
Hi Karl,
Your problem is similar with the bug:https://bugs.launchpad.net/nova/+bug/2015892 <https://bugs.launchpad.net/nova/+bug/2015892>
I guess you don’t split the mig if using A serial card.
*发件人**:*Karl Kloppenborg [mailto:kkloppenborg@resetdata.com.au <mailto:kkloppenborg@resetdata.com.au>] *发送时间**:*2023年9月22日0:43 *收件人**:*Sylvain Bauza <sbauza@redhat.com <mailto:sbauza@redhat.com>> *抄送**:*openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org> *主题**:*Re: Cyborg nova reports mdev-capable resource is not available
Hi Sylvian,
Thanks for getting back to me.
So the vGPU is available and cyborg is allocating it using ARQ binding.
You can see Nova receives this request:
2023-09-21 16:38:51.889 1901814 DEBUG nova.compute.manager [None req-97062e9c-0c44-480e-9918-4a5a810175b2 78e83e5a446e4071ae43e823135dcb3c 21eb701c2a1f48b38dab8f34c0a20902 - - default default] ARQs for spec:{'2d60c353-0419-4b67-8cb7-913fc6f5cef9': {'uuid': '2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'state': 'Bound', 'device_profile_name': 'VGPU_A40-Q48', 'device_profile_group_id': 0, 'hostname': 'gpu-c-01', 'device_rp_uuid': '229bf15f-5689-3d2c-b37b-5c8439ea6a71', 'instance_uuid': '1b090007-791b-4997-af89-0feb886cf11d', 'project_id': None, 'attach_handle_type': 'MDEV', 'attach_handle_uuid': '866bd6a5-b156-4251-a969-64fefb32f16f', 'attach_handle_info': {'asked_type': 'nvidia-566', 'bus': 'ca', 'device': '01', 'domain': '0000', 'function': '1', 'vgpu_mark': 'nvidia-566_0'}, 'links': [{'href': ' http://cyborg-api.openstack.svc.cluster.local:6666/accelerator/v2/accelerato...',
'rel': 'self'}], 'created_at': '2023-09-21T16:38:42+00:00', 'updated_at': '2023-09-21T16:38:42+00:00'}}, ARQs for network:{} _build_resources
/var/lib/openstack/lib/python3.10/site-packages/nova/compute/manager.py:2680
So the mdev is then allocated in the resource providers at that point.
Is there some cyborg nova patching code I am missing?
*From: *Sylvain Bauza <sbauza@redhat.com <mailto:sbauza@redhat.com>> *Date: *Friday, 22 September 2023 at 1:49 am *To: *Karl Kloppenborg <kkloppenborg@resetdata.com.au <mailto:kkloppenborg@resetdata.com.au>> *Cc: *openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org><
openstack-discuss@lists.openstack.org
<mailto:openstack-discuss@lists.openstack.org>> *Subject: *Re: Cyborg nova reports mdev-capable resource is not available
Le jeu. 21 sept. 2023 à 17:27, Karl Kloppenborg <kkloppenborg@resetdata.com.au <mailto:kkloppenborg@resetdata.com.au>> a écrit :
Hi Cyborg Team!
Karl from Helm Team.
When creating a VM with the correct flavor, the mdev gets created by cyborg agent and I can see it in the nodedev-list --cap mdev.
However Nova then fails with:
nova.virt.libvirt.driver [<removed>- - default default] Searching for available mdevs... _get_existing_mdevs_not_assigned
/var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py
:8357
2023-09-21 14:34:47.808 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Available mdevs at: set().
2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] No available mdevs where found. Creating an new one... _allocate_mdevs
/var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driv
er.py:8496
2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] Attempting to create new mdev... _create_new_mediated_device
/var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py:8385
2023-09-21 14:34:48.455 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_4b_03_1', … <truncated list>].
2023-09-21 14:34:48.456 1901814 ERROR nova.compute.manager [<removed> - - default default] [instance: 2026e2a2-b17a-43ab-adcb-62a907f58b51] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.
I don't exactly remember how Cyborg passes the devices to nova/libvirt but this exception is because none of the available GPUs have either existing mdevs or capability for creating mdevs.
You should first check sysfs to double-check the state of our GPU devices in order to understand how much of vGPU capacity you still have.
-Sylvain
Once this happened, ARQ removes the mdev and cleans up.
I’ve got Cyborg 2023.2 running and have a device profile like so:
karl@Karls-Air ~ % openstack accelerator device profile show e2b07e11-fe69-4f33-83fc-0f9e38adb7ae
+-------------+---------------------------------------------------------------------------+
| Field | Value |
+-------------+---------------------------------------------------------------------------+
| created_at | 2023-09-21 13:30:05+00:00 |
| updated_at | None |
| uuid | e2b07e11-fe69-4f33-83fc-0f9e38adb7ae |
| name | VGPU_A40-Q48 |
| groups | [{'resources:VGPU': '1', 'trait:CUSTOM_NVIDIA_2235_A40_48Q': 'required'}] |
| description | None |
+-------------+---------------------------------------------------------------------------+
karl@Karls-Air ~ %
I can see the allocation candidate:
karl@Karls-Air ~ % openstack allocation candidate list --resource VGPU=1 | grep A40
| 41 | VGPU=1 | 229bf15f-5689-3d2c-b37b-5c8439ea6a71 | VGPU=0/1 | OWNER_CYBORG,CUSTOM_NVIDIA_2235_A40_48Q |
karl@Karls-Air ~ %
Am I missing something critical here? Because I cannot seem to figure this out… have I got a PCI address wrong, or something?
Any help from the Cyborg or Nova teams would be fantastic.
Thanks, Karl.
Hi, Ulrich, The poc code in cyborg side to manage A serial card is: https://review.opendev.org/c/openstack/cyborg/+/855986 . As one vf can only passthrough to one VM, we first check the vf used status, and then check the unused vf by the available_instances in the /sys/class/mdev_bus/0000\:81\:00.4/mdev_supported_types/nvidia-476/ dir, and report the available vf which available_instances is 1 to db for schedule. Best regards to refer 发件人: Ulrich Schwickerath [mailto:ulrich.schwickerath@cern.ch] 发送时间: 2023年12月7日 21:18 收件人: Alex Song (宋文平) <songwenping@inspur.com>; kkloppenborg@resetdata.com.au; sbauza@redhat.com 抄送: openstack-discuss@lists.openstack.org 主题: Re: 答复: Cyborg nova reports mdev-capable resource is not available Hi, sorry for stepping in so late. We have seen a very similar issue to https://bugs.launchpad.net/nova/+bug/2015892 and the one described here. Our case is maybe a bit different: we have a box with 4 A100 cards, and we wanted to set it up with MIG enabled on all of them. What we've seen was that more mediated devices were created than actually available, and that (both nova and cyborg) created too many resource providers in placement. Then, depending on which resource provider was selected, the VM failed to spawn because the selected devices was not available. We've now a patch for Nova for this which seems to work for our use case. The basic idea of the patch is to check how many instances are available for each of the physical GPUs (from the description field of one of the devices, e.g. /sys/class/mdev_bus/0000\:81\:00.4/mdev_supported_types/nvidia-476/description) and ensure that only this number of resource providers is created in placement. A side effect is that it also works if the GPUs in the hypervisor have different MIG setups (with the restriction that each card must only be setup with a single MIG type e.g. 2g.10gb,2g.10gb,2g.10gb for GPU0 and 3g.20gb,3g.20gb for GPU1 etc). It's not perfect, and in particular I have not tested if this breaks the regular VGPU setup for time based slicing of the GPUs nor how to do this with Cyborg. The VGPU type selection is done via a trait and a custom resource. If you're interested, feel free to get back to me. Cheers, Ulrich On 22/09/2023 07:49, Alex Song (宋文平) wrote: Please reference the nvidia official doc: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#introduction 发件人: Karl Kloppenborg [mailto:kkloppenborg@resetdata.com.au] 发送时间: 2023年9月22日 10:53 收件人: Alex Song (宋文平) <mailto:songwenping@inspur.com> <songwenping@inspur.com>; sbauza@redhat.com <mailto:sbauza@redhat.com> 抄送: openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org> 主题: Re: Cyborg nova reports mdev-capable resource is not available Ah thank you for pointing me towards that Alex. I guess, I should probably look at the MIG pathway. I wonder if it’s possible to do vGPU profiles in MIG configuration. Have you any experience with this? Thanks, Karl. From: Alex Song (宋文平) < <mailto:songwenping@inspur.com> songwenping@inspur.com> Date: Friday, 22 September 2023 at 12:17 pm To: Karl Kloppenborg < <mailto:kkloppenborg@resetdata.com.au> kkloppenborg@resetdata.com.au>, <mailto:sbauza@redhat.com> sbauza@redhat.com < <mailto:sbauza@redhat.com> sbauza@redhat.com> Cc: <mailto:openstack-discuss@lists.openstack.org> openstack-discuss@lists.openstack.org < <mailto:openstack-discuss@lists.openstack.org> openstack-discuss@lists.openstack.org> Subject: 答复: Cyborg nova reports mdev-capable resource is not available Hi Karl, Your problem is similar with the bug: <https://bugs.launchpad.net/nova/+bug/2015892> https://bugs.launchpad.net/nova/+bug/2015892 I guess you don’t split the mig if using A serial card. 发件人: Karl Kloppenborg [ <mailto:kkloppenborg@resetdata.com.au> mailto:kkloppenborg@resetdata.com.au] 发送时间: 2023年9月22日 0:43 收件人: Sylvain Bauza < <mailto:sbauza@redhat.com> sbauza@redhat.com> 抄送: <mailto:openstack-discuss@lists.openstack.org> openstack-discuss@lists.openstack.org 主题: Re: Cyborg nova reports mdev-capable resource is not available Hi Sylvian, Thanks for getting back to me. So the vGPU is available and cyborg is allocating it using ARQ binding. You can see Nova receives this request: 2023-09-21 16:38:51.889 1901814 DEBUG nova.compute.manager [None req-97062e9c-0c44-480e-9918-4a5a810175b2 78e83e5a446e4071ae43e823135dcb3c 21eb701c2a1f48b38dab8f34c0a20902 - - default default] ARQs for spec:{'2d60c353-0419-4b67-8cb7-913fc6f5cef9': {'uuid': '2d60c353-0419-4b67-8cb7-913fc6f5cef9', 'state': 'Bound', 'device_profile_name': 'VGPU_A40-Q48', 'device_profile_group_id': 0, 'hostname': 'gpu-c-01', 'device_rp_uuid': '229bf15f-5689-3d2c-b37b-5c8439ea6a71', 'instance_uuid': '1b090007-791b-4997-af89-0feb886cf11d', 'project_id': None, 'attach_handle_type': 'MDEV', 'attach_handle_uuid': '866bd6a5-b156-4251-a969-64fefb32f16f', 'attach_handle_info': {'asked_type': 'nvidia-566', 'bus': 'ca', 'device': '01', 'domain': '0000', 'function': '1', 'vgpu_mark': 'nvidia-566_0'}, 'links': [{'href': 'http://cyborg-api.openstack.svc.cluster.local:6666/accelerator/v2/accelerato...', 'rel': 'self'}], 'created_at': '2023-09-21T16:38:42+00:00', 'updated_at': '2023-09-21T16:38:42+00:00'}}, ARQs for network:{} _build_resources /var/lib/openstack/lib/python3.10/site-packages/nova/compute/manager.py:2680 So the mdev is then allocated in the resource providers at that point. Is there some cyborg nova patching code I am missing? From: Sylvain Bauza < <mailto:sbauza@redhat.com> sbauza@redhat.com> Date: Friday, 22 September 2023 at 1:49 am To: Karl Kloppenborg < <mailto:kkloppenborg@resetdata.com.au> kkloppenborg@resetdata.com.au> Cc: <mailto:openstack-discuss@lists.openstack.org> openstack-discuss@lists.openstack.org < <mailto:openstack-discuss@lists.openstack.org> openstack-discuss@lists.openstack.org> Subject: Re: Cyborg nova reports mdev-capable resource is not available Le jeu. 21 sept. 2023 à 17:27, Karl Kloppenborg < <mailto:kkloppenborg@resetdata.com.au> kkloppenborg@resetdata.com.au> a écrit : Hi Cyborg Team! Karl from Helm Team. When creating a VM with the correct flavor, the mdev gets created by cyborg agent and I can see it in the nodedev-list --cap mdev. However Nova then fails with: nova.virt.libvirt.driver [<removed>- - default default] Searching for available mdevs... _get_existing_mdevs_not_assigned /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py :8357 2023-09-21 14:34:47.808 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Available mdevs at: set(). 2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] No available mdevs where found. Creating an new one... _allocate_mdevs /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driv er.py:8496 2023-09-21 14:34:47.809 1901814 DEBUG nova.virt.libvirt.driver [<removed> - - default default] Attempting to create new mdev... _create_new_mediated_device /var/lib/openstack/lib/python3.10/site-packages/nova/virt/libvirt/driver.py:8385 2023-09-21 14:34:48.455 1901814 INFO nova.virt.libvirt.driver [<removed> - - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_4b_03_1', … <truncated list>]. 2023-09-21 14:34:48.456 1901814 ERROR nova.compute.manager [<removed> - - default default] [instance: 2026e2a2-b17a-43ab-adcb-62a907f58b51] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available. I don't exactly remember how Cyborg passes the devices to nova/libvirt but this exception is because none of the available GPUs have either existing mdevs or capability for creating mdevs. You should first check sysfs to double-check the state of our GPU devices in order to understand how much of vGPU capacity you still have. -Sylvain Once this happened, ARQ removes the mdev and cleans up. I’ve got Cyborg 2023.2 running and have a device profile like so: karl@Karls-Air ~ % openstack accelerator device profile show e2b07e11-fe69-4f33-83fc-0f9e38adb7ae +-------------+---------------------------------------------------------------------------+ | Field | Value | +-------------+---------------------------------------------------------------------------+ | created_at | 2023-09-21 13:30:05+00:00 | | updated_at | None | | uuid | e2b07e11-fe69-4f33-83fc-0f9e38adb7ae | | name | VGPU_A40-Q48 | | groups | [{'resources:VGPU': '1', 'trait:CUSTOM_NVIDIA_2235_A40_48Q': 'required'}] | | description | None | +-------------+---------------------------------------------------------------------------+ karl@Karls-Air ~ % I can see the allocation candidate: karl@Karls-Air ~ % openstack allocation candidate list --resource VGPU=1 | grep A40 | 41 | VGPU=1 | 229bf15f-5689-3d2c-b37b-5c8439ea6a71 | VGPU=0/1 | OWNER_CYBORG,CUSTOM_NVIDIA_2235_A40_48Q | karl@Karls-Air ~ % Am I missing something critical here? Because I cannot seem to figure this out… have I got a PCI address wrong, or something? Any help from the Cyborg or Nova teams would be fantastic. Thanks, Karl.
participants (4)
-
Alex Song (宋文平)
-
Karl Kloppenborg
-
Sylvain Bauza
-
Ulrich Schwickerath