From:
melanie witt <melwittt@gmail.com>
Date: Saturday, 9 November 2024 at 6:20 am
To: Mickael Razzouk <mickael.razzouk@infomaniak.com>, openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org>
Subject: Re: [nova] mdev management for vgpu
[You don't often get email from melwittt@gmail.com. Learn why this is important at
https://aka.ms/LearnAboutSenderIdentification ]
On 11/8/24 02:04, Mickael Razzouk wrote:
> Hello, I am tying a few things with vGPU on nova and I struggle with the mdev management Nova does.
>
> My environment:
> - Nvidia A2 pGPU
> - openstack caracal
>
> What I did :
> - installed the nvidia grid host driver on the compute
> - enabled the virtual function of the pGPU
> - added in nova.conf the mdev type wanted and the pci address of the virtual functions
> - restarted nova
> - created a flavor with the property resources:VGPU=1
>
> (basicaly followed this documentation :
https://docs.openstack.org/nova/latest/admin/virtual-gpu.html)
>
> the only diference is that I did not put the pGPU pci address in the nova.conf but the address of its virtual functions, so in my case 0000:41:00.4 and not 0000:41:00.0 (the RP were not detected otherwise)
>
>>From my understanding of this specs sheet (https://specs.openstack.org/openstack/nova-specs/specs/queens/implemented/add-support-for-vgpu.html)
and some experimentation,
> When a vgpu is requested by the end user :
> - Nova looks at the mdevs already running and use them if available.
> - If there is no mdev available, Nova looks at the RP tree to find mdev capable pci devices that have the specific mdev type available.
> - If there is such pci device available, Nova create the mdev with a UUID and create the domain XML with this UUID inside.
> - If no such device is available, Nova returns no host available as an error.
>
> My questions are :
>
> - Mdev are not persistent across reboot, but when it happens, nova crash at boot because the mdevs are missing, just recreating mdev does not fix the issue as they need to have the same UUID.
> One could fix the issue by making mdev persistent using a tool like mdevctl, but by creating the mdev the first time I seem to me that Nova tries to manage mdev, is it normal for it to require manual intervention afterward ?
> Or am I missing something ?
I can't answer both of your questions but I can say that for 2024.1
(Caracal), no you are not missing something -- mdevs are not persisted
across reboot. See https://bugs.launchpad.net/nova/+bug/1900800 for details.
However in 2024.2 (Dalmatian) we added support for persistent mdevs
(note this requires libvirt >= 7.3.0):
https://docs.openstack.org/releasenotes/nova/2024.2.html
HTH,
-melwitt
> - In the nova configuration, is it normal to diverge from the documentation and enter the VF PCI addresses to detect the RF ?
>
> Thank you in advance for your help.