Experience with VGPUs
Mahendra Paipuri
mahendra.paipuri at cnrs.fr
Thu Jun 22 08:43:25 UTC 2023
Hello all,
Thanks @Ulrich for sharing the presentation. Very informative!!
One question : if I understood correctly, *time-sliced *vGPUs
*absolutely need* GRID drivers and licensed clients for the vGPUs to
work in the guests. For the MIG partitioning, there is *no need* to
install GRID drivers in the guest and also *no need* to have licensed
clients. Could you confirm if this is the actual case?
Cheers.
Regards
Mahendra
On 21/06/2023 16:10, Ulrich Schwickerath wrote:
>
> Hi, all,
>
> Sylvain explained quite well how to do it technically. We have a PoC
> running, however, still have some stability issues, as mentioned on
> the summit. We're running the NVIDIA virtualisation drivers on the
> hypervisors and the guests, which requires a license from NVIDIA. In
> our configuration we are still quite limited in the sense that we have
> to configure all cards in the same hypervisor in the same way, that is
> the same MIG partitioning. Also, it is not possible to attach more
> than one device to a single VM.
>
> As mentioned in the presentation we are a bit behind with Nova, and in
> the process of fixing this as we speak. Because of that we had to do a
> couple of back ports in Nova to make it work, which we hope to be able
> to get rid of by the ongoing upgrades.
>
> Let me see if I can make the slides available here.
>
> Cheers, Ulrich
>
> On 20/06/2023 19:07, Oliver Weinmann wrote:
>> Hi everyone,
>>
>> Jumping into this topic again. Unfortunately I haven’t had time yet
>> to test Nvidia VGPU in OpenStack but in VMware Vsphere. What our
>> users complain most about is the inflexibility since you have to use
>> the same profile on all vms that use the gpu. One user mentioned to
>> try SLURM. I know there is no official OpenStack project for SLURM
>> but I wonder if anyone else tried this approach? If I understood
>> correctly this would also not require any Nvidia subscription since
>> you passthrough the GPU to a single instance and you don’t use VGPU
>> nor MIG.
>>
>> Cheers,
>> Oliver
>>
>> Von meinem iPhone gesendet
>>
>>> Am 20.06.2023 um 17:34 schrieb Sylvain Bauza <sbauza at redhat.com>:
>>>
>>>
>>>
>>>
>>> Le mar. 20 juin 2023 à 16:31, Mahendra Paipuri
>>> <mahendra.paipuri at cnrs.fr> a écrit :
>>>
>>> Thanks Sylvain for the pointers.
>>>
>>> One of the questions we have is: can we create MIG profiles on
>>> the host and then attach each one or more profile(s) to VMs?
>>> This bug [1] reports that once we attach one profile to a VM,
>>> rest of MIG profiles become unavailable. From what you have said
>>> about using SR-IOV and VFs, I guess this should be possible.
>>>
>>>
>>> Correct, what you need is to create first the VFs using sriov-manage
>>> and then you can create the MIG instances.
>>> Once you create the MIG instances using the profiles you want, you
>>> will see that the related available_instances for the nvidia mdev
>>> type (by looking at sysfs) will say that you can have a single vGPU
>>> for this profile.
>>> Then, you can use that mdev type with Nova using nova.conf.
>>>
>>> That being said, while this above is simple, the below talk was
>>> saying more about how to correctly use the GPU by the host so please
>>> wait :-)
>>>
>>> I think you are talking about "vGPUs with OpenStack Nova" talk
>>> on OpenInfra stage. I will look into it once the videos will be
>>> online.
>>>
>>>
>>> Indeed.
>>> -S
>>>
>>> [1] https://bugs.launchpad.net/nova/+bug/2008883
>>>
>>> Thanks
>>>
>>> Regards
>>>
>>> Mahendra
>>>
>>> On 20/06/2023 15:47, Sylvain Bauza wrote:
>>>>
>>>>
>>>> Le mar. 20 juin 2023 à 15:12, PAIPURI Mahendra
>>>> <mahendra.paipuri at cnrs.fr> a écrit :
>>>>
>>>> Hello Ulrich,
>>>>
>>>>
>>>> I am relaunching this discussion as I noticed that you gave
>>>> a talk about this topic at OpenInfra Summit in Vancouver.
>>>> Is it possible to share the presentation here? I hope the
>>>> talks will be uploaded soon in YouTube.
>>>>
>>>>
>>>> We are mainly interested in using MIG instances in
>>>> Openstack cloud and I could not really find a lot of
>>>> information by googling. If you could share your
>>>> experiences, that would be great.
>>>>
>>>>
>>>>
>>>> Due to scheduling conflicts, I wasn't able to attend Ulrich's
>>>> session but his feedback will be greatly listened to by me.
>>>>
>>>> FWIW, there was also a short session about how to enable MIG
>>>> and play with Nova at the OpenInfra stage (and that one I was
>>>> able to attend it), and it was quite seamless. What exact
>>>> information are you looking for ?
>>>> The idea with MIG is that you need to create SRIOV VFs above
>>>> the MIG instances using sriov-manage script provided by nvidia
>>>> so that the mediated devices will use those VFs as the base PCI
>>>> devices to be used for Nova.
>>>>
>>>> Cheers.
>>>>
>>>>
>>>> Regards
>>>>
>>>> Mahendra
>>>>
>>>> ------------------------------------------------------------------------
>>>> *De :* Ulrich Schwickerath <Ulrich.Schwickerath at cern.ch>
>>>> *Envoyé :* lundi 16 janvier 2023 11:38:08
>>>> *À :* openstack-discuss at lists.openstack.org
>>>> *Objet :* Re: 答复: Experience with VGPUs
>>>>
>>>> Hi, all,
>>>>
>>>> just to add to the discussion, at CERN we have recently
>>>> deployed a bunch of A100 GPUs in PCI passthrough mode, and
>>>> are now looking into improving their usage by using MIG.
>>>> From the NOVA point of view things seem to work OK, we can
>>>> schedule VMs requesting a VGPU, the client starts up and
>>>> gets a license token from our NVIDIA license server
>>>> (distributing license keys is our private cloud is
>>>> relatively easy in our case). It's a PoC only for the time
>>>> being, and we're not ready to put that forward as we're
>>>> facing issues with CUDA on the client (it fails immediately
>>>> in memory operations with 'not supported', still
>>>> investigating why this happens).
>>>>
>>>> Once we get that working it would be nice to be able to
>>>> have a more fine grained scheduling so that people can ask
>>>> for MIG devices of different size. The other challenge is
>>>> how to set limits on GPU resources. Once the above issues
>>>> have been sorted out we may want to look into cyborg as
>>>> well thus we are quite interested in first experiences with
>>>> this.
>>>>
>>>> Kind regards,
>>>>
>>>> Ulrich
>>>>
>>>> On 13.01.23 21:06, Dmitriy Rabotyagov wrote:
>>>>> To have that said, deb/rpm packages they are providing
>>>>> doesn't help much, as:
>>>>> * There is no repo for them, so you need to download them
>>>>> manually from enterprise portal
>>>>> * They can't be upgraded anyway, as driver version is part
>>>>> of the package name. And each package conflicts with any
>>>>> another one. So you need to explicitly remove old package
>>>>> and only then install new one. And yes, you must stop all
>>>>> VMs before upgrading driver and no, you can't live migrate
>>>>> GPU mdev devices due to that now being implemented in
>>>>> qemu. So deb/rpm/generic driver doesn't matter at the end tbh.
>>>>>
>>>>>
>>>>> пт, 13 янв. 2023 г., 20:56 Cedric <yipikai7 at gmail.com>:
>>>>>
>>>>>
>>>>> Ended up with the very same conclusions than Dimitry
>>>>> regarding the use of Nvidia Vgrid for the VGPU use
>>>>> case with Nova, it works pretty well but:
>>>>>
>>>>> - respecting the licensing model as operationnal
>>>>> constraints, note that guests need to reach a license
>>>>> server in order to get a token (could be via the
>>>>> Nvidia SaaS service or on-prem)
>>>>> - drivers for both guest and hypervisor are not easy
>>>>> to implement and maintain on large scale. A year ago,
>>>>> hypervisors drivers were not packaged to
>>>>> Debian/Ubuntu, but builded though a bash script, thus
>>>>> requiering additional automatisation work and careful
>>>>> attention regarding kernel update/reboot of Nova
>>>>> hypervisors.
>>>>>
>>>>> Cheers
>>>>>
>>>>>
>>>>> On Fri, Jan 13, 2023 at 4:21 PM Dmitriy Rabotyagov
>>>>> <noonedeadpunk at gmail.com> wrote:
>>>>> >
>>>>> > You are saying that, like Nvidia GRID drivers are
>>>>> open-sourced while
>>>>> > in fact they're super far from being that. In order
>>>>> to download
>>>>> > drivers not only for hypervisors, but also for guest
>>>>> VMs you need to
>>>>> > have an account in their Enterprise Portal. It took
>>>>> me roughly 6 weeks
>>>>> > of discussions with hardware vendors and Nvidia
>>>>> support to get a
>>>>> > proper account there. And that happened only after
>>>>> applying for their
>>>>> > Partner Network (NPN).
>>>>> > That still doesn't solve the issue of how to provide
>>>>> drivers to
>>>>> > guests, except pre-build a series of images with
>>>>> these drivers
>>>>> > pre-installed (we ended up with making a DIB element
>>>>> for that [1]).
>>>>> > Not saying about the need to distribute license
>>>>> tokens for guests and
>>>>> > the whole mess with compatibility between hypervisor
>>>>> and guest drivers
>>>>> > (as guest driver can't be newer then host one, and
>>>>> HVs can't be too
>>>>> > new either).
>>>>> >
>>>>> > It's not that I'm protecting AMD, but just saying
>>>>> that Nvidia is not
>>>>> > that straightforward either, and at least on paper
>>>>> AMD vGPUs look
>>>>> > easier both for operators and end-users.
>>>>> >
>>>>> > [1]
>>>>> https://github.com/citynetwork/dib-elements/tree/main/nvgrid
>>>>> >
>>>>> > >
>>>>> > > As for AMD cards, AMD stated that some of their MI
>>>>> series card supports SR-IOV for vGPUs. However, those
>>>>> drivers are never open source or provided closed
>>>>> source to public, only large cloud providers are able
>>>>> to get them. So I don't really recommend getting AMD
>>>>> cards for vGPU unless you are able to get support from
>>>>> them.
>>>>> > >
>>>>> >
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230622/48ea6037/attachment-0001.htm>
More information about the openstack-discuss
mailing list