Open Stack

Mon Jan 16 15:36:45 UTC 2023

Hi Ulrich,

I believe this is a perfect use case for Cyborg which provides
state-of-the-art heterogeneous hardware management and is easy to use.

cc: Brin Zhang

Thank you
Regards
Li Liu

On Mon, Jan 16, 2023 at 5:39 AM Ulrich Schwickerath <
Ulrich.Schwickerath at cern.ch> wrote:

> Hi, all,
>
> just to add to the discussion, at CERN we have recently deployed a bunch
> of A100 GPUs in PCI passthrough mode, and are now looking into improving
> their usage by using MIG. From the NOVA point of view things seem to work
> OK, we can schedule VMs requesting a VGPU, the client starts up and gets a
> license token from our NVIDIA license server (distributing license keys is
> our private cloud is relatively easy in our case). It's a PoC only for the
> time being, and we're not ready to put that forward as we're facing issues
> with CUDA on the client (it fails immediately in memory operations with
> 'not supported', still investigating why this happens).
>
> Once we get that working it would be nice to be able to have a more fine
> grained scheduling so that people can ask for MIG devices of different
> size. The other challenge is how to set limits on GPU resources. Once the
> above issues have been sorted out we may want to look into cyborg as well
> thus we are quite interested in first experiences with this.
>
> Kind regards,
>
> Ulrich
> On 13.01.23 21:06, Dmitriy Rabotyagov wrote:
>
> To have that said, deb/rpm packages they are providing doesn't help much,
> as:
> * There is no repo for them, so you need to download them manually from
> enterprise portal
> * They can't be upgraded anyway, as driver version is part of the package
> name. And each package conflicts with any another one. So you need to
> explicitly remove old package and only then install new one. And yes, you
> must stop all VMs before upgrading driver and no, you can't live migrate
> GPU mdev devices due to that now being implemented in qemu. So
> deb/rpm/generic driver doesn't matter at the end tbh.
>
>
> пт, 13 янв. 2023 г., 20:56 Cedric <yipikai7 at gmail.com>:
>
>>
>> Ended up with the very same conclusions than Dimitry regarding the use of
>> Nvidia Vgrid for the VGPU use case with Nova, it works pretty well but:
>>
>> - respecting the licensing model as operationnal constraints, note that
>> guests need to reach a license server in order to get a token (could be via
>> the Nvidia SaaS service or on-prem)
>> - drivers for both guest and hypervisor are not easy to implement and
>> maintain on large scale. A year ago, hypervisors drivers were not packaged
>> to Debian/Ubuntu, but builded though a bash script, thus requiering
>> additional automatisation work and careful attention regarding kernel
>> update/reboot of Nova hypervisors.
>>
>> Cheers
>>
>>
>> On Fri, Jan 13, 2023 at 4:21 PM Dmitriy Rabotyagov <
>> noonedeadpunk at gmail.com> wrote:
>> >
>> > You are saying that, like Nvidia GRID drivers are open-sourced while
>> > in fact they're super far from being that. In order to download
>> > drivers not only for hypervisors, but also for guest VMs you need to
>> > have an account in their Enterprise Portal. It took me roughly 6 weeks
>> > of discussions with hardware vendors and Nvidia support to get a
>> > proper account there. And that happened only after applying for their
>> > Partner Network (NPN).
>> > That still doesn't solve the issue of how to provide drivers to
>> > guests, except pre-build a series of images with these drivers
>> > pre-installed (we ended up with making a DIB element for that [1]).
>> > Not saying about the need to distribute license tokens for guests and
>> > the whole mess with compatibility between hypervisor and guest drivers
>> > (as guest driver can't be newer then host one, and HVs can't be too
>> > new either).
>> >
>> > It's not that I'm protecting AMD, but just saying that Nvidia is not
>> > that straightforward either, and at least on paper AMD vGPUs look
>> > easier both for operators and end-users.
>> >
>> > [1] https://github.com/citynetwork/dib-elements/tree/main/nvgrid
>> >
>> > >
>> > > As for AMD cards, AMD stated that some of their MI series card
>> supports SR-IOV for vGPUs. However, those drivers are never open source or
>> provided closed source to public, only large cloud providers are able to
>> get them. So I don't really recommend getting AMD cards for vGPU unless you
>> are able to get support from them.
>> > >
>> >
>>
>

-- 
Thank you

Regards

Li
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230116/9f42c33a/attachment.htm>

Open Stack

Re: 答复: Experience with VGPUs

OpenStack

Community

Documentation

Branding & Legal