答复: Experience with VGPUs

Sean Mooney smooney at redhat.com
Fri Jan 13 14:16:15 UTC 2023


On Fri, 2023-01-13 at 13:07 +0000, Gene Kuo wrote:
> Hi Oliver,
> 
> I had some experience on using Nvidia vGPUs (Tesla P4) in my own OpenStack cluster. The setup is pretty simple, follow the guides from Nvidia to install Linux KVM drivers[1] and OpenStack document[2] for attaching vGPU mdevs to your instances. Licensing is at the client (VM) side and not the server (hypervisor) side. The cards that you mentioned you are using (RTX3050/3060) doesn't support vGPU, there is a list of supported cards listed by Nvidia[3].
> 
> For newer cards using MIGs I have no experience but I would expect the overall procedure to be similar.
the main differnce for mig mode is that the mdevs are created ontop of sriov VFs
so from a nova prespective instead of listing the adress of the PF you need to enable the VFs instead in the config.
its more or less the same other then that on the nova side.

obvioulsy there is alittle more work to cofnigr the VFs ectra on the host for mig mode but its mostly transparent to nova
all that changes is which pci device (the PF or VF) provides the inventories of mdevs which nova will attach to the vm.
in the MIG case each vf expose at most 1 mdev instance of a specific type with out mig the pf expose multiple instance of a singel mdev type.
> 
> As for AMD cards, AMD stated that some of their MI series card supports SR-IOV for vGPUs. However, those drivers are never open source or provided closed source to public, only large cloud providers are able to get them. So I don't really recommend getting AMD cards for vGPU unless you are able to get support from them.
ya so on the amd side if you happen ot have those drivers then instead of using nova vGPU feature you jsut use normal pci passhtough
pci passthough in nova contery to what some assume was not orgianly added for sriov networking. it was added for intel QAT device and supprots
PFs, VFs and non sriov capable pcie devices.

as long as the device is stateles you can use it with the generic pci passthough supprot via the pci alais in the instance flavor.
so if you have the drvier you just need to create a pci alias for the amd gpu vfs and use them like any other accelorator that support sriov.
> 
> Regards,
> Gene Kuo
> 
> [1] https://docs.nvidia.com/grid/13.0/grid-vgpu-user-guide/index.html#red-hat-el-kvm-install-configure-vgpu
> [2] https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
> [3] https://docs.nvidia.com/grid/gpus-supported-by-vgpu.html
> 
> ------- Original Message -------
> On Friday, January 13th, 2023 at 12:03 PM, Brin Zhang <zhangbailin at inspur.com> wrote:
> 
> 
> > -> ----邮件原件-----
> > 
> > > 发件人: Arne Wiebalck [mailto:Arne.Wiebalck at cern.ch]
> > > 发送时间: 2023年1月12日 15:43
> > > 收件人: Oliver Weinmann oliver.weinmann at me.com; openstack-discuss openstack-discuss at lists.openstack.org
> > > 主题: Re: Experience with VGPUs
> > > 
> > > Hi Oliver,
> > > 
> > > The presentation you linked was only at CERN, not from CERN (it was during an OpenStack Day we organised here). Sylvain and/or Mohammed may be available to answer the questions you have related to that deck, or also in general for the integration of GPUs.
> > 
> > > Now, at CERN we also have hypervisors with different GPUs in our fleet, and are also looking into various options how to efficiently provision them:
> > > as bare metal, as vGPUs, using MIG support, ... and we have submitted a presentation proposal for the upcoming summit to share our experiences.
> > 
> > > If you have very specific questions, we can try to answer them here, but maybe there is interest and it would be more efficient to organize a session/call (e.g. as part of the Openstack Operators activities or the Scientific SIG?) to exchange experiences on GPU integration and answer questions there?
> > 
> > > What do you and others think?
> > 
> > > Cheers,
> > > Arne
> > > 
> > > ________________________________________
> > > From: Oliver Weinmann oliver.weinmann at me.com
> > > Sent: Thursday, 12 January 2023 07:56
> > > To: openstack-discuss
> > > Subject: Experience with VGPUs
> > > 
> > > Dear All,
> > > 
> > > we are planning to have a POC on VGPUs in our Openstack cluster. Therefore I have a few questions and generally wanted to ask how well VGPUs are supported in Openstack. The docs, in particular:
> > > 
> > > https://docs.openstack.org/nova/zed/admin/virtual-gpu.html
> > > 
> > > explain quite well the general implementation.
> > > 
> > > But I am more interested in general experience with using VGPUs in Openstack. We currently have a small YOGA cluster, planning to upgrade to Zed soon, with a couple of compute nodes. Currently our users use consumer cards like RTX 3050/3060 on their laptops and the idea would be to provide VGPUs to these users. For this I
> > > would like to make a very small POC where we first equip one compute node with an Nvidia GPU. Gladly also a few tips on which card would be a good starting point are highly appreciated. I know this heavily depends on the server hardware but this is something I can figure out later. Also do we need additional software
> > > licenses > to run this? I saw this very nice presentation from CERN on VGPUs:
> > > 
> > > https://indico.cern.ch/event/776411/contributions/3345183/attachments/1851624/3039917/02_-_vGPUs_with_OpenStack_-_Accelerating_Science.pdf
> > 
> > > In the table they are listing Quadro vDWS licenses. I assume we need these in order to use the cards? Also do we need something like Cyborg for this or is VGPU fully implemented in Nova?
> > 
> > 
> > You can try to use Cyborg manage your GPU devices, it also can support list/attach vGPU for an instance, if you want to attach/detach an device from an instance that you should transform your flavor, because the vGPU/GPU info need to be added in flavor now(If you want to use this feature may be need to separate such GPU metadata from flavor, we have discussed in nova team before).
> > I am working in Inspur, in our InCloud OS conduct, we are using Cyborg manage GPU/vGPU, FPGA, QAT etc. devices. And adapted GPU T4/T100 (support vGPU), A100(support mig), I think use Cyborg to better manage local GPU devices, please refer api docs of Cyborg https://docs.openstack.org/api-ref/accelerator/
> > 
> > > Best Regards,
> > 
> > > Oliver
> 




More information about the openstack-discuss mailing list