Hi, Ulrich,
The poc code in cyborg side to manage A serial card is: https://review.opendev.org/c/openstack/cyborg/+/855986 .
As one vf can only passthrough to one VM, we first check the vf used status, and then check the unused vf by the available_instances in the /sys/class/mdev_bus/0000\:81\:00.4/mdev_supported_types/nvidia-476/ dir, and report the available vf which available_instances is 1 to db for schedule.
Best regards to refer
发件人: Ulrich Schwickerath [mailto:ulrich.schwickerath@cern.ch]
发送时间: 2023年12月7日 21:18
收件人: Alex Song (宋文平) <songwenping@inspur.com>; kkloppenborg@resetdata.com.au; sbauza@redhat.com
抄送: openstack-discuss@lists.openstack.org
主题: Re: 答复: Cyborg nova reports mdev-capable resource is not available
Hi,
sorry for stepping in so late. We have seen a very similar issue to https://bugs.launchpad.net/nova/+bug/2015892 and the one described here. Our case is maybe a bit different: we have a box with 4 A100 cards, and we wanted to set it up with MIG enabled on all of them. What we've seen was that more mediated devices were created than actually available, and that (both nova and cyborg) created too many resource providers in placement. Then, depending on which resource provider was selected, the VM failed to spawn because the selected devices was not available. We've now a patch for Nova for this which seems to work for our use case. The basic idea of the patch is to check how many instances are available for each of the physical GPUs (from the description field of one of the devices, e.g. /sys/class/mdev_bus/0000\:81\:00.4/mdev_supported_types/nvidia-476/description) and ensure that only this number of resource providers is created in placement. A side effect is that it also works if the GPUs in the hypervisor have different MIG setups (with the restriction that each card must only be setup with a single MIG type e.g. 2g.10gb,2g.10gb,2g.10gb for GPU0 and 3g.20gb,3g.20gb for GPU1 etc). It's not perfect, and in particular I have not tested if this breaks the regular VGPU setup for time based slicing of the GPUs nor how to do this with Cyborg. The VGPU type selection is done via a trait and a custom resource. If you're interested, feel free to get back to me.
Cheers, Ulrich
On 22/09/2023 07:49, Alex Song (宋文平) wrote:
Please reference the nvidia official doc: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#introduction
发件人: Karl Kloppenborg [mailto:kkloppenborg@resetdata.com.au]
发送时间: 2023年9月22日 10:53
收件人: Alex Song (宋文平) <songwenping@inspur.com>; sbauza@redhat.com
抄送: openstack-discuss@lists.openstack.org
主题: Re: Cyborg nova reports mdev-capable resource is not available
Ah thank you for pointing me towards that Alex.
I guess, I should probably look at the MIG pathway.
I wonder if it’s possible to do vGPU profiles in MIG configuration.
Have you any experience with this?
Thanks,
Karl.