Re: 答复: Experience with VGPUs
Mahendra Paipuri
mahendra.paipuri at cnrs.fr
Tue Jun 20 14:11:09 UTC 2023
Thanks Sylvain for the pointers.
One of the questions we have is: can we create MIG profiles on the host
and then attach each one or more profile(s) to VMs? This bug [1] reports
that once we attach one profile to a VM, rest of MIG profiles become
unavailable. From what you have said about using SR-IOV and VFs, I guess
this should be possible.
I think you are talking about "vGPUs with OpenStack Nova" talk on
OpenInfra stage. I will look into it once the videos will be online.
[1] https://bugs.launchpad.net/nova/+bug/2008883
Thanks
Regards
Mahendra
On 20/06/2023 15:47, Sylvain Bauza wrote:
>
>
> Le mar. 20 juin 2023 à 15:12, PAIPURI Mahendra
> <mahendra.paipuri at cnrs.fr> a écrit :
>
> Hello Ulrich,
>
>
> I am relaunching this discussion as I noticed that you gave a talk
> about this topic at OpenInfra Summit in Vancouver. Is it possible
> to share the presentation here? I hope the talks will be uploaded
> soon in YouTube.
>
>
> We are mainly interested in using MIG instances in Openstack cloud
> and I could not really find a lot of information by googling. If
> you could share your experiences, that would be great.
>
>
>
> Due to scheduling conflicts, I wasn't able to attend Ulrich's session
> but his feedback will be greatly listened to by me.
>
> FWIW, there was also a short session about how to enable MIG and play
> with Nova at the OpenInfra stage (and that one I was able to attend
> it), and it was quite seamless. What exact information are you looking
> for ?
> The idea with MIG is that you need to create SRIOV VFs above the MIG
> instances using sriov-manage script provided by nvidia so that the
> mediated devices will use those VFs as the base PCI devices to be used
> for Nova.
>
> Cheers.
>
>
> Regards
>
> Mahendra
>
> ------------------------------------------------------------------------
> *De :* Ulrich Schwickerath <Ulrich.Schwickerath at cern.ch>
> *Envoyé :* lundi 16 janvier 2023 11:38:08
> *À :* openstack-discuss at lists.openstack.org
> *Objet :* Re: 答复: Experience with VGPUs
>
> Hi, all,
>
> just to add to the discussion, at CERN we have recently deployed a
> bunch of A100 GPUs in PCI passthrough mode, and are now looking
> into improving their usage by using MIG. From the NOVA point of
> view things seem to work OK, we can schedule VMs requesting a
> VGPU, the client starts up and gets a license token from our
> NVIDIA license server (distributing license keys is our private
> cloud is relatively easy in our case). It's a PoC only for the
> time being, and we're not ready to put that forward as we're
> facing issues with CUDA on the client (it fails immediately in
> memory operations with 'not supported', still investigating why
> this happens).
>
> Once we get that working it would be nice to be able to have a
> more fine grained scheduling so that people can ask for MIG
> devices of different size. The other challenge is how to set
> limits on GPU resources. Once the above issues have been sorted
> out we may want to look into cyborg as well thus we are quite
> interested in first experiences with this.
>
> Kind regards,
>
> Ulrich
>
> On 13.01.23 21:06, Dmitriy Rabotyagov wrote:
>> To have that said, deb/rpm packages they are providing doesn't
>> help much, as:
>> * There is no repo for them, so you need to download them
>> manually from enterprise portal
>> * They can't be upgraded anyway, as driver version is part of the
>> package name. And each package conflicts with any another one. So
>> you need to explicitly remove old package and only then install
>> new one. And yes, you must stop all VMs before upgrading driver
>> and no, you can't live migrate GPU mdev devices due to that now
>> being implemented in qemu. So deb/rpm/generic driver doesn't
>> matter at the end tbh.
>>
>>
>> пт, 13 янв. 2023 г., 20:56 Cedric <yipikai7 at gmail.com>:
>>
>>
>> Ended up with the very same conclusions than Dimitry
>> regarding the use of Nvidia Vgrid for the VGPU use case with
>> Nova, it works pretty well but:
>>
>> - respecting the licensing model as operationnal constraints,
>> note that guests need to reach a license server in order to
>> get a token (could be via the Nvidia SaaS service or on-prem)
>> - drivers for both guest and hypervisor are not easy to
>> implement and maintain on large scale. A year ago,
>> hypervisors drivers were not packaged to Debian/Ubuntu, but
>> builded though a bash script, thus requiering additional
>> automatisation work and careful attention regarding kernel
>> update/reboot of Nova hypervisors.
>>
>> Cheers
>>
>>
>> On Fri, Jan 13, 2023 at 4:21 PM Dmitriy Rabotyagov
>> <noonedeadpunk at gmail.com> wrote:
>> >
>> > You are saying that, like Nvidia GRID drivers are
>> open-sourced while
>> > in fact they're super far from being that. In order to download
>> > drivers not only for hypervisors, but also for guest VMs
>> you need to
>> > have an account in their Enterprise Portal. It took me
>> roughly 6 weeks
>> > of discussions with hardware vendors and Nvidia support to
>> get a
>> > proper account there. And that happened only after applying
>> for their
>> > Partner Network (NPN).
>> > That still doesn't solve the issue of how to provide drivers to
>> > guests, except pre-build a series of images with these drivers
>> > pre-installed (we ended up with making a DIB element for
>> that [1]).
>> > Not saying about the need to distribute license tokens for
>> guests and
>> > the whole mess with compatibility between hypervisor and
>> guest drivers
>> > (as guest driver can't be newer then host one, and HVs
>> can't be too
>> > new either).
>> >
>> > It's not that I'm protecting AMD, but just saying that
>> Nvidia is not
>> > that straightforward either, and at least on paper AMD
>> vGPUs look
>> > easier both for operators and end-users.
>> >
>> > [1]
>> https://github.com/citynetwork/dib-elements/tree/main/nvgrid
>> >
>> > >
>> > > As for AMD cards, AMD stated that some of their MI series
>> card supports SR-IOV for vGPUs. However, those drivers are
>> never open source or provided closed source to public, only
>> large cloud providers are able to get them. So I don't really
>> recommend getting AMD cards for vGPU unless you are able to
>> get support from them.
>> > >
>> >
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230620/5c7872e7/attachment-0001.htm>
More information about the openstack-discuss
mailing list