Re: 答复: Experience with VGPUs

Sylvain Bauza sbauza at redhat.com
Tue Jun 20 13:47:33 UTC 2023


Le mar. 20 juin 2023 à 15:12, PAIPURI Mahendra <mahendra.paipuri at cnrs.fr> a
écrit :

> Hello Ulrich,
>
>
> I am relaunching this discussion as I noticed that you gave a talk about
> this topic at OpenInfra Summit in Vancouver. Is it possible to share the
> presentation here? I hope the talks will be uploaded soon in YouTube.
>
>
> We are mainly interested in using MIG instances in Openstack cloud and I
> could not really find a lot of information by googling. If you could share
> your experiences, that would be great.
>
>
>
Due to scheduling conflicts, I wasn't able to attend Ulrich's session but
his feedback will be greatly listened to by me.

FWIW, there was also a short session about how to enable MIG and play with
Nova at the OpenInfra stage (and that one I was able to attend it), and it
was quite seamless. What exact information are you looking for ?
The idea with MIG is that you need to create SRIOV VFs above the MIG
instances using sriov-manage script provided by nvidia so that the mediated
devices will use those VFs as the base PCI devices to be used for Nova.

Cheers.
>
>
> Regards
>
> Mahendra
> ------------------------------
> *De :* Ulrich Schwickerath <Ulrich.Schwickerath at cern.ch>
> *Envoyé :* lundi 16 janvier 2023 11:38:08
> *À :* openstack-discuss at lists.openstack.org
> *Objet :* Re: 答复: Experience with VGPUs
>
>
> Hi, all,
>
> just to add to the discussion, at CERN we have recently deployed a bunch
> of A100 GPUs in PCI passthrough mode, and are now looking into improving
> their usage by using MIG. From the NOVA point of view things seem to work
> OK, we can schedule VMs requesting a VGPU, the client starts up and gets a
> license token from our NVIDIA license server (distributing license keys is
> our private cloud is relatively easy in our case). It's a PoC only for the
> time being, and we're not ready to put that forward as we're facing issues
> with CUDA on the client (it fails immediately in memory operations with
> 'not supported', still investigating why this happens).
>
> Once we get that working it would be nice to be able to have a more fine
> grained scheduling so that people can ask for MIG devices of different
> size. The other challenge is how to set limits on GPU resources. Once the
> above issues have been sorted out we may want to look into cyborg as well
> thus we are quite interested in first experiences with this.
>
> Kind regards,
>
> Ulrich
> On 13.01.23 21:06, Dmitriy Rabotyagov wrote:
>
> To have that said, deb/rpm packages they are providing doesn't help much,
> as:
> * There is no repo for them, so you need to download them manually from
> enterprise portal
> * They can't be upgraded anyway, as driver version is part of the package
> name. And each package conflicts with any another one. So you need to
> explicitly remove old package and only then install new one. And yes, you
> must stop all VMs before upgrading driver and no, you can't live migrate
> GPU mdev devices due to that now being implemented in qemu. So
> deb/rpm/generic driver doesn't matter at the end tbh.
>
>
> пт, 13 янв. 2023 г., 20:56 Cedric <yipikai7 at gmail.com>:
>
>>
>> Ended up with the very same conclusions than Dimitry regarding the use of
>> Nvidia Vgrid for the VGPU use case with Nova, it works pretty well but:
>>
>> - respecting the licensing model as operationnal constraints, note that
>> guests need to reach a license server in order to get a token (could be via
>> the Nvidia SaaS service or on-prem)
>> - drivers for both guest and hypervisor are not easy to implement and
>> maintain on large scale. A year ago, hypervisors drivers were not packaged
>> to Debian/Ubuntu, but builded though a bash script, thus requiering
>> additional automatisation work and careful attention regarding kernel
>> update/reboot of Nova hypervisors.
>>
>> Cheers
>>
>>
>> On Fri, Jan 13, 2023 at 4:21 PM Dmitriy Rabotyagov <
>> noonedeadpunk at gmail.com> wrote:
>> >
>> > You are saying that, like Nvidia GRID drivers are open-sourced while
>> > in fact they're super far from being that. In order to download
>> > drivers not only for hypervisors, but also for guest VMs you need to
>> > have an account in their Enterprise Portal. It took me roughly 6 weeks
>> > of discussions with hardware vendors and Nvidia support to get a
>> > proper account there. And that happened only after applying for their
>> > Partner Network (NPN).
>> > That still doesn't solve the issue of how to provide drivers to
>> > guests, except pre-build a series of images with these drivers
>> > pre-installed (we ended up with making a DIB element for that [1]).
>> > Not saying about the need to distribute license tokens for guests and
>> > the whole mess with compatibility between hypervisor and guest drivers
>> > (as guest driver can't be newer then host one, and HVs can't be too
>> > new either).
>> >
>> > It's not that I'm protecting AMD, but just saying that Nvidia is not
>> > that straightforward either, and at least on paper AMD vGPUs look
>> > easier both for operators and end-users.
>> >
>> > [1] https://github.com/citynetwork/dib-elements/tree/main/nvgrid
>> >
>> > >
>> > > As for AMD cards, AMD stated that some of their MI series card
>> supports SR-IOV for vGPUs. However, those drivers are never open source or
>> provided closed source to public, only large cloud providers are able to
>> get them. So I don't really recommend getting AMD cards for vGPU unless you
>> are able to get support from them.
>> > >
>> >
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230620/78d74de9/attachment-0001.htm>


More information about the openstack-discuss mailing list