Hey team,

Is anyone working with NVLink in their clouds? How are you handling passing through the right set of GPU's per NVLink 'group'

I have servers with 4 sets of 2 way NVLinks(8 cards in pairs of 2) and I'm able to passthrough the PCI devices to the VM no problem but there is no guarantee that the NVLink pair get passed through together and if we end up with 1 GPU from one pair and 1 GPU from another pair then we run into all sorts of issues(As you'd expect)

So I'm looking to understand how Nova can be NVLink aware I guess but I'm struggling to find any conversation ro material on the topic but I assume it's been done before?

I did find this talk [1] which mentioned this problem but they write some sort of hack to sit in between nova and qemu, while quite a clever solution it seems like there must be a better way to do this in 2022?

[1] - https://www.openstack.org/videos/summits/vancouver-2018/can-we-boost-more-hpc-performance-integrate-ibm-power-servers-with-gpus-to-openstack-environment

Cheers,

Cory