Le jeu. 21 avr. 2022 à 12:26, Sean Mooney <smooney@redhat.com> a écrit :
Hi, As far as I can tell, libvirt/KVM supports multiple vGPUs per VM ( https://docs.nvidia.com/grid/14.0/grid-vgpu-release-notes-generic-linux-kvm/... ), but in OpenStack/Nova it is limited to one vGPU per VM ( https://docs.openstack.org/nova/latest/admin/virtual-gpu.html#configure-a-fl... ) Is there a reason for this limit? yes nvidia What would be needed to enable multiple vGPUs in Nova? so you can technically do it today if you have 2 vGPU for seperate
On Wed, 2022-04-20 at 16:42 +0000, Sigurd Kristian Brinch wrote: physical gpu cards but nvidia do not support multiple vGPUs form the same card.
nova does not currently provide a way to force the gpu allocation to be from seperate cards.
well thats not quite true you could
you would have to use the named group syntax to request them so instaed of resources:vgpu=2
you woudl do
resources_first_gpu_group:VGPU=1 resources_second_gpu_group:VGPU=1 group_policy=isolate
the name after resouces_ is arbitray group name provided it conforms to this regex '([a-zA-Z0-9_-]{1,64})?'
we stongly dislike this approch. first of all using group_policy=isolate is a gloabl thing meaning that no request groups can come form the same provider
that means you can not have to sriov VFs from the same physical nic as a result of setting it. if you dont set group_policy the default is none which means you no longer are guarenteed that they will come form different providres
so what you woudl need to do is extend placment to support isolating only sepeicic named groups and then expose that in nova via flavor extra specs which is not particaly good ux as it rather complicated and means you need to understand how placement works in depth. placement shoudl really be an implemenation detail i.e. resources_first_gpu_group:VGPU=1 resources_second_gpu_group:VGPU=1 group_isolate=first_grpu_group,second_gpu_group;...
that fixes the confilct with sriov and all other usages of resouce groups like bandwith based qos
the slightly better approch wouls be to make this simplere to use by doing somtihng liek this
resources:vgpu=2 vgpu:gpu_selection_policy=isolate
we would still need the placement feature to isolate by group but we can hide the detail form the end user with a pre filter in nova
https://github.com/openstack/nova/blob/eedbff38599addd4574084edac8b111c4e1f2... which will transfrom the resouce request and split it up into groups automatically
this is a long way to say that if it was not for limiations in the iommu on nvidia gpus and the fact that they cannot map two vgpus to from on phsyical gpu to a singel vm this would already work out of hte box wiht just resources:vgpu=2. perhaps when intel lauch there discret datacenter gpus there vGPU implementaiotn will not have this limiation. we do not prevent you from requestin 2 vgpus today it will just fail when qemu tries to use them.
we also have not put the effort into working around the limiation in nvidias hardware since ther drivers also used to block this until the ampear generation and there has nto been a large request to support multipel vgpus form users.
ocationally some will ask about it but in general peopel either do full gpu passthough or use 1 vgpu instance.
Correct, that's why we have this open bug report for a while, but we don't really want to fix for only one vendor.
hopefully that will help. you can try the first approch today if you have more then one physical gpu per host e.g. resources_first_gpu_group:VGPU=1 resources_second_gpu_group:VGPU=1 group_policy=isolate
just be aware of the limiation fo group_policy=isolate
Thanks Sean for explaining how to use a workaround.
regard sean
BR Sigurd