Re: 答复: Experience with VGPUs

Mahendra Paipuri mahendra.paipuri at cnrs.fr
Tue Jun 20 14:11:09 UTC 2023


Thanks Sylvain for the pointers.

One of the questions we have is: can we create MIG profiles on the host 
and then attach each one or more profile(s) to VMs? This bug [1] reports 
that once we attach one profile to a VM, rest of MIG profiles become 
unavailable. From what you have said about using SR-IOV and VFs, I guess 
this should be possible.

I think you are talking about "vGPUs with OpenStack Nova" talk on 
OpenInfra stage. I will look into it once the videos will be online.

[1] https://bugs.launchpad.net/nova/+bug/2008883

Thanks

Regards

Mahendra

On 20/06/2023 15:47, Sylvain Bauza wrote:
>
>
> Le mar. 20 juin 2023 à 15:12, PAIPURI Mahendra 
> <mahendra.paipuri at cnrs.fr> a écrit :
>
>     Hello Ulrich,
>
>
>     I am relaunching this discussion as I noticed that you gave a talk
>     about this topic at OpenInfra Summit in Vancouver. Is it possible
>     to share the presentation here? I hope the talks will be uploaded
>     soon in YouTube.
>
>
>     We are mainly interested in using MIG instances in Openstack cloud
>     and I could not really find a lot of information by googling. If
>     you could share your experiences, that would be great.
>
>
>
> Due to scheduling conflicts, I wasn't able to attend Ulrich's session 
> but his feedback will be greatly listened to by me.
>
> FWIW, there was also a short session about how to enable MIG and play 
> with Nova at the OpenInfra stage (and that one I was able to attend 
> it), and it was quite seamless. What exact information are you looking 
> for ?
> The idea with MIG is that you need to create SRIOV VFs above the MIG 
> instances using sriov-manage script provided by nvidia so that the 
> mediated devices will use those VFs as the base PCI devices to be used 
> for Nova.
>
>     Cheers.
>
>
>     Regards
>
>     Mahendra
>
>     ------------------------------------------------------------------------
>     *De :* Ulrich Schwickerath <Ulrich.Schwickerath at cern.ch>
>     *Envoyé :* lundi 16 janvier 2023 11:38:08
>     *À :* openstack-discuss at lists.openstack.org
>     *Objet :* Re: 答复: Experience with VGPUs
>
>     Hi, all,
>
>     just to add to the discussion, at CERN we have recently deployed a
>     bunch of A100 GPUs in PCI passthrough mode, and are now looking
>     into improving their usage by using MIG. From the NOVA point of
>     view things seem to work OK, we can schedule VMs requesting a
>     VGPU, the client starts up and gets a license token from our
>     NVIDIA license server (distributing license keys is our private
>     cloud is relatively easy in our case). It's a PoC only for the
>     time being, and we're not ready to put that forward as we're
>     facing issues with CUDA on the client (it fails immediately in
>     memory operations with 'not supported', still investigating why
>     this happens).
>
>     Once we get that working it would be nice to be able to have a
>     more fine grained scheduling so that people can ask for MIG
>     devices of different size. The other challenge is how to set
>     limits on GPU resources. Once the above issues have been sorted
>     out we may want to look into cyborg as well thus we are quite
>     interested in first experiences with this.
>
>     Kind regards,
>
>     Ulrich
>
>     On 13.01.23 21:06, Dmitriy Rabotyagov wrote:
>>     To have that said, deb/rpm packages they are providing doesn't
>>     help much, as:
>>     * There is no repo for them, so you need to download them
>>     manually from enterprise portal
>>     * They can't be upgraded anyway, as driver version is part of the
>>     package name. And each package conflicts with any another one. So
>>     you need to explicitly remove old package and only then install
>>     new one. And yes, you must stop all VMs before upgrading driver
>>     and no, you can't live migrate GPU mdev devices due to that now
>>     being implemented in qemu. So deb/rpm/generic driver doesn't
>>     matter at the end tbh.
>>
>>
>>     пт, 13 янв. 2023 г., 20:56 Cedric <yipikai7 at gmail.com>:
>>
>>
>>         Ended up with the very same conclusions than Dimitry
>>         regarding the use of Nvidia Vgrid for the VGPU use case with
>>         Nova, it works pretty well but:
>>
>>         - respecting the licensing model as operationnal constraints,
>>         note that guests need to reach a license server in order to
>>         get a token (could be via the Nvidia SaaS service or on-prem)
>>         - drivers for both guest and hypervisor are not easy to
>>         implement and maintain on large scale. A year ago,
>>         hypervisors drivers were not packaged to Debian/Ubuntu, but
>>         builded though a bash script, thus requiering additional
>>         automatisation work and careful attention regarding kernel
>>         update/reboot of Nova hypervisors.
>>
>>         Cheers
>>
>>
>>         On Fri, Jan 13, 2023 at 4:21 PM Dmitriy Rabotyagov
>>         <noonedeadpunk at gmail.com> wrote:
>>         >
>>         > You are saying that, like Nvidia GRID drivers are
>>         open-sourced while
>>         > in fact they're super far from being that. In order to download
>>         > drivers not only for hypervisors, but also for guest VMs
>>         you need to
>>         > have an account in their Enterprise Portal. It took me
>>         roughly 6 weeks
>>         > of discussions with hardware vendors and Nvidia support to
>>         get a
>>         > proper account there. And that happened only after applying
>>         for their
>>         > Partner Network (NPN).
>>         > That still doesn't solve the issue of how to provide drivers to
>>         > guests, except pre-build a series of images with these drivers
>>         > pre-installed (we ended up with making a DIB element for
>>         that [1]).
>>         > Not saying about the need to distribute license tokens for
>>         guests and
>>         > the whole mess with compatibility between hypervisor and
>>         guest drivers
>>         > (as guest driver can't be newer then host one, and HVs
>>         can't be too
>>         > new either).
>>         >
>>         > It's not that I'm protecting AMD, but just saying that
>>         Nvidia is not
>>         > that straightforward either, and at least on paper AMD
>>         vGPUs look
>>         > easier both for operators and end-users.
>>         >
>>         > [1]
>>         https://github.com/citynetwork/dib-elements/tree/main/nvgrid
>>         >
>>         > >
>>         > > As for AMD cards, AMD stated that some of their MI series
>>         card supports SR-IOV for vGPUs. However, those drivers are
>>         never open source or provided closed source to public, only
>>         large cloud providers are able to get them. So I don't really
>>         recommend getting AMD cards for vGPU unless you are able to
>>         get support from them.
>>         > >
>>         >
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230620/5c7872e7/attachment-0001.htm>


More information about the openstack-discuss mailing list