Experience with VGPUs

Mahendra Paipuri mahendra.paipuri at cnrs.fr
Thu Jun 22 08:43:25 UTC 2023


Hello all,

Thanks @Ulrich for sharing the presentation. Very informative!!

One question : if I understood correctly, *time-sliced *vGPUs 
*absolutely need* GRID drivers and licensed clients for the vGPUs to 
work in the guests. For the MIG partitioning, there is *no need* to 
install GRID drivers in the guest and also *no need* to have licensed 
clients. Could you confirm if this is the actual case?

Cheers.

Regards

Mahendra

On 21/06/2023 16:10, Ulrich Schwickerath wrote:
>
> Hi, all,
>
> Sylvain explained quite well how to do it technically. We have a PoC 
> running, however, still have some stability issues, as mentioned on 
> the summit. We're running the NVIDIA virtualisation drivers on the 
> hypervisors and the guests, which requires a license from NVIDIA. In 
> our configuration we are still quite limited in the sense that we have 
> to configure all cards in the same hypervisor in the same way, that is 
> the same MIG partitioning. Also, it is not possible to attach more 
> than one device to a single VM.
>
> As mentioned in the presentation we are a bit behind with Nova, and in 
> the process of fixing this as we speak. Because of that we had to do a 
> couple of back ports in Nova to make it work, which we hope to be able 
> to get rid of by the ongoing upgrades.
>
> Let me  see if I can make the slides available here.
>
> Cheers, Ulrich
>
> On 20/06/2023 19:07, Oliver Weinmann wrote:
>> Hi everyone,
>>
>> Jumping into this topic again. Unfortunately I haven’t had time yet 
>> to test Nvidia VGPU in OpenStack but in VMware Vsphere. What our 
>> users complain most about is the inflexibility since you have to use 
>> the same profile on all vms that use the gpu. One user mentioned to 
>> try SLURM. I know there is no official OpenStack project for SLURM 
>> but I wonder if anyone else tried this approach? If I understood 
>> correctly this would also not require any Nvidia subscription since 
>> you passthrough the GPU to a single instance and you don’t use VGPU 
>> nor MIG.
>>
>> Cheers,
>> Oliver
>>
>> Von meinem iPhone gesendet
>>
>>> Am 20.06.2023 um 17:34 schrieb Sylvain Bauza <sbauza at redhat.com>:
>>>
>>> 
>>>
>>>
>>> Le mar. 20 juin 2023 à 16:31, Mahendra Paipuri 
>>> <mahendra.paipuri at cnrs.fr> a écrit :
>>>
>>>     Thanks Sylvain for the pointers.
>>>
>>>     One of the questions we have is: can we create MIG profiles on
>>>     the host and then attach each one or more profile(s) to VMs?
>>>     This bug [1] reports that once we attach one profile to a VM,
>>>     rest of MIG profiles become unavailable. From what you have said
>>>     about using SR-IOV and VFs, I guess this should be possible.
>>>
>>>
>>> Correct, what you need is to create first the VFs using sriov-manage 
>>> and then you can create the MIG instances.
>>> Once you create the MIG instances using the profiles you want, you 
>>> will see that the related available_instances for the nvidia mdev 
>>> type (by looking at sysfs) will say that you can have a single vGPU 
>>> for this profile.
>>> Then, you can use that mdev type with Nova using nova.conf.
>>>
>>> That being said, while this above is simple, the below talk was 
>>> saying more about how to correctly use the GPU by the host so please 
>>> wait :-)
>>>
>>>     I think you are talking about "vGPUs with OpenStack Nova" talk
>>>     on OpenInfra stage. I will look into it once the videos will be
>>>     online.
>>>
>>>
>>> Indeed.
>>> -S
>>>
>>>     [1] https://bugs.launchpad.net/nova/+bug/2008883
>>>
>>>     Thanks
>>>
>>>     Regards
>>>
>>>     Mahendra
>>>
>>>     On 20/06/2023 15:47, Sylvain Bauza wrote:
>>>>
>>>>
>>>>     Le mar. 20 juin 2023 à 15:12, PAIPURI Mahendra
>>>>     <mahendra.paipuri at cnrs.fr> a écrit :
>>>>
>>>>         Hello Ulrich,
>>>>
>>>>
>>>>         I am relaunching this discussion as I noticed that you gave
>>>>         a talk about this topic at OpenInfra Summit in Vancouver.
>>>>         Is it possible to share the presentation here? I hope the
>>>>         talks will be uploaded soon in YouTube.
>>>>
>>>>
>>>>         We are mainly interested in using MIG instances in
>>>>         Openstack cloud and I could not really find a lot of
>>>>         information by googling. If you could share your
>>>>         experiences, that would be great.
>>>>
>>>>
>>>>
>>>>     Due to scheduling conflicts, I wasn't able to attend Ulrich's
>>>>     session but his feedback will be greatly listened to by me.
>>>>
>>>>     FWIW, there was also a short session about how to enable MIG
>>>>     and play with Nova at the OpenInfra stage (and that one I was
>>>>     able to attend it), and it was quite seamless. What exact
>>>>     information are you looking for ?
>>>>     The idea with MIG is that you need to create SRIOV VFs above
>>>>     the MIG instances using sriov-manage script provided by nvidia
>>>>     so that the mediated devices will use those VFs as the base PCI
>>>>     devices to be used for Nova.
>>>>
>>>>         Cheers.
>>>>
>>>>
>>>>         Regards
>>>>
>>>>         Mahendra
>>>>
>>>>         ------------------------------------------------------------------------
>>>>         *De :* Ulrich Schwickerath <Ulrich.Schwickerath at cern.ch>
>>>>         *Envoyé :* lundi 16 janvier 2023 11:38:08
>>>>         *À :* openstack-discuss at lists.openstack.org
>>>>         *Objet :* Re: 答复: Experience with VGPUs
>>>>
>>>>         Hi, all,
>>>>
>>>>         just to add to the discussion, at CERN we have recently
>>>>         deployed a bunch of A100 GPUs in PCI passthrough mode, and
>>>>         are now looking into improving their usage by using MIG.
>>>>         From the NOVA point of view things seem to work OK, we can
>>>>         schedule VMs requesting a VGPU, the client starts up and
>>>>         gets a license token from our NVIDIA license server
>>>>         (distributing license keys is our private cloud is
>>>>         relatively easy in our case). It's a PoC only for the time
>>>>         being, and we're not ready to put that forward as we're
>>>>         facing issues with CUDA on the client (it fails immediately
>>>>         in memory operations with 'not supported', still
>>>>         investigating why this happens).
>>>>
>>>>         Once we get that working it would be nice to be able to
>>>>         have a more fine grained scheduling so that people can ask
>>>>         for MIG devices of different size. The other challenge is
>>>>         how to set limits on GPU resources. Once the above issues
>>>>         have been sorted out we may want to look into cyborg as
>>>>         well thus we are quite interested in first experiences with
>>>>         this.
>>>>
>>>>         Kind regards,
>>>>
>>>>         Ulrich
>>>>
>>>>         On 13.01.23 21:06, Dmitriy Rabotyagov wrote:
>>>>>         To have that said, deb/rpm packages they are providing
>>>>>         doesn't help much, as:
>>>>>         * There is no repo for them, so you need to download them
>>>>>         manually from enterprise portal
>>>>>         * They can't be upgraded anyway, as driver version is part
>>>>>         of the package name. And each package conflicts with any
>>>>>         another one. So you need to explicitly remove old package
>>>>>         and only then install new one. And yes, you must stop all
>>>>>         VMs before upgrading driver and no, you can't live migrate
>>>>>         GPU mdev devices due to that now being implemented in
>>>>>         qemu. So deb/rpm/generic driver doesn't matter at the end tbh.
>>>>>
>>>>>
>>>>>         пт, 13 янв. 2023 г., 20:56 Cedric <yipikai7 at gmail.com>:
>>>>>
>>>>>
>>>>>             Ended up with the very same conclusions than Dimitry
>>>>>             regarding the use of Nvidia Vgrid for the VGPU use
>>>>>             case with Nova, it works pretty well but:
>>>>>
>>>>>             - respecting the licensing model as operationnal
>>>>>             constraints, note that guests need to reach a license
>>>>>             server in order to get a token (could be via the
>>>>>             Nvidia SaaS service or on-prem)
>>>>>             - drivers for both guest and hypervisor are not easy
>>>>>             to implement and maintain on large scale. A year ago,
>>>>>             hypervisors drivers were not packaged to
>>>>>             Debian/Ubuntu, but builded though a bash script, thus
>>>>>             requiering additional automatisation work and careful
>>>>>             attention regarding kernel update/reboot of Nova
>>>>>             hypervisors.
>>>>>
>>>>>             Cheers
>>>>>
>>>>>
>>>>>             On Fri, Jan 13, 2023 at 4:21 PM Dmitriy Rabotyagov
>>>>>             <noonedeadpunk at gmail.com> wrote:
>>>>>             >
>>>>>             > You are saying that, like Nvidia GRID drivers are
>>>>>             open-sourced while
>>>>>             > in fact they're super far from being that. In order
>>>>>             to download
>>>>>             > drivers not only for hypervisors, but also for guest
>>>>>             VMs you need to
>>>>>             > have an account in their Enterprise Portal. It took
>>>>>             me roughly 6 weeks
>>>>>             > of discussions with hardware vendors and Nvidia
>>>>>             support to get a
>>>>>             > proper account there. And that happened only after
>>>>>             applying for their
>>>>>             > Partner Network (NPN).
>>>>>             > That still doesn't solve the issue of how to provide
>>>>>             drivers to
>>>>>             > guests, except pre-build a series of images with
>>>>>             these drivers
>>>>>             > pre-installed (we ended up with making a DIB element
>>>>>             for that [1]).
>>>>>             > Not saying about the need to distribute license
>>>>>             tokens for guests and
>>>>>             > the whole mess with compatibility between hypervisor
>>>>>             and guest drivers
>>>>>             > (as guest driver can't be newer then host one, and
>>>>>             HVs can't be too
>>>>>             > new either).
>>>>>             >
>>>>>             > It's not that I'm protecting AMD, but just saying
>>>>>             that Nvidia is not
>>>>>             > that straightforward either, and at least on paper
>>>>>             AMD vGPUs look
>>>>>             > easier both for operators and end-users.
>>>>>             >
>>>>>             > [1]
>>>>>             https://github.com/citynetwork/dib-elements/tree/main/nvgrid
>>>>>             >
>>>>>             > >
>>>>>             > > As for AMD cards, AMD stated that some of their MI
>>>>>             series card supports SR-IOV for vGPUs. However, those
>>>>>             drivers are never open source or provided closed
>>>>>             source to public, only large cloud providers are able
>>>>>             to get them. So I don't really recommend getting AMD
>>>>>             cards for vGPU unless you are able to get support from
>>>>>             them.
>>>>>             > >
>>>>>             >
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230622/48ea6037/attachment-0001.htm>


More information about the openstack-discuss mailing list