Hello list,
I'm struggling deploying Rocky with vGPU using nvidia drivers.
Has anyone experienced the issues loading nvidia modules?
I'm talking about hypervisor part of the setup. There are two modules provided by nvidia. One loads correctly it's the nvidia.ko one.
The other however does not. The module is called nvidia-vgpu-vfio.ko
I'm trying to load it and it seems that 7.6 kernel is no longer compatible with it
modprobe nvidia-vgpu-vfio
modprobe: ERROR: could not insert 'nvidia_vgpu_vfio': Invalid argument
dmesg shows this:
nvidia_vgpu_vfio: disagrees about version of symbol vfio_pin_pages
nvidia_vgpu_vfio: Unknown symbol vfio_pin_pages (err -22)
nvidia_vgpu_vfio: disagrees about version of symbol vfio_unpin_pages
nvidia_vgpu_vfio: Unknown symbol vfio_unpin_pages (err -22)
nvidia_vgpu_vfio: disagrees about version of symbol vfio_register_notifier
nvidia_vgpu_vfio: Unknown symbol vfio_register_notifier (err -22)
nvidia_vgpu_vfio: disagrees about version of symbol vfio_unregister_notifier
nvidia_vgpu_vfio: Unknown symbol vfio_unregister_notifier (err -22)
modinfo nvidia-vgpu-vfio
filename:       /lib/modules/3.10.0-957.27.2.el7.x86_64/weak-updates/nvidia-vgpu-vfio.ko
version:        430.27
supported:      external
license:        MIT
rhelversion:    7.6
srcversion:     0A179A61A02AD500D05FB1A
alias:          pci:v000010DEd00000E00sv*sd*bc04sc80i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        nvidia,mdev,vfio
vermagic:       3.10.0-940.el7.x86_64 SMP mod_unload modversions 
My guess is that somewhere along the rhel/centos 7.6 lifecycle vfio module changed the vfio module and broke the compatibility.
Nvidia provides those modules built against the BETA 7.6 release and assume weak-modules to make it work.
Somehow it does not.
Anybody got any suggestions how to handle this? I'm working on it with nvidia enterprise support but maybe one of you got there first?
best regards
-- 
Piotr Baranowski