[nova] NVLink passthrough
Hey team, Is anyone working with NVLink in their clouds? How are you handling passing through the right set of GPU's per NVLink 'group' I have servers with 4 sets of 2 way NVLinks(8 cards in pairs of 2) and I'm able to passthrough the PCI devices to the VM no problem but there is no guarantee that the NVLink pair get passed through together and if we end up with 1 GPU from one pair and 1 GPU from another pair then we run into all sorts of issues(As you'd expect) So I'm looking to understand how Nova can be NVLink aware I guess but I'm struggling to find any conversation ro material on the topic but I assume it's been done before? I did find this talk [1] which mentioned this problem but they write some sort of hack to sit in between nova and qemu, while quite a clever solution it seems like there must be a better way to do this in 2022? [1] - https://www.openstack.org/videos/summits/vancouver-2018/can-we-boost-more-hp... Cheers, Cory
On Sat, 2022-06-25 at 23:05 +0930, Cory Hawkvelt wrote:
Hey team,
Is anyone working with NVLink in their clouds? How are you handling passing through the right set of GPU's per NVLink 'group'
in terms of nova vGPU support we really only supprot passing a singel vGPU to the guest. for pci_passthough we dont currently have nay kind of grouping by nvlink group or pf ectra so that is not somethign that is really supproted today.
I have servers with 4 sets of 2 way NVLinks(8 cards in pairs of 2) and I'm able to passthrough the PCI devices to the VM no problem but there is no guarantee that the NVLink pair get passed through together and if we end up with 1 GPU from one pair and 1 GPU from another pair then we run into all sorts of issues(As you'd expect)
So I'm looking to understand how Nova can be NVLink aware I guess but I'm struggling to find any conversation ro material on the topic but I assume it's been done before?
today it is not but this could actully be supproted indrectly with teh pci in placment work that is in flight in zed. i say indirectly as that wont actully allow you to model the nvlink groups cleanly but https://specs.openstack.org/openstack/nova-specs/specs/zed/approved/pci-devi... provides a way to list device with traits or cutsom resocue classes. that would allow you to tag the device in teh whitelist with the nvlink group device_spec = { "address: "<gpu addres>" "resource_class":"CUSTOM_NVIDIA_A100", "traits": "CUSTOM_NVLINK_GROUP_1" } the pci alias will then be able to request the gpu using the resouce class and trait so you coudl create an alias per group that unfortunetly means we need to have a flaovr per alias to use all the groups so you would need 4 flaovor
I did find this talk [1] which mentioned this problem but they write some sort of hack to sit in between nova and qemu, while quite a clever solution it seems like there must be a better way to do this in 2022?
you are the first person im aware of to actully ask for nvlink affinity upstream so no unfortunetly no one as really looked at a better way. if we can discover the groups once we complte modling pci device in plcmane we coudl model the group stuctor there and perhaps extend the alias to allow that to be requested cleanly. basically we woudl model the groups in the resouce provider tree and use the same subtree request parmater to get the pairs of devicecs. this woudl be a similar mechanium that we woudl use for pf affinity or anti affintiy. but no unfortunetly even with master there is not really anything you can do today to enable you use case in nova. the other related feature that comes up with gpu passthough is we do not supprot multifunction device passthough today. e.g. on nvidia gpus the audio and graphics endpoint are two differnt pci subfunciton on a single device. the windows driver expect that and wont work if you pashtough both as they are passed though to the guest as two seperate devices.
[1] - https://www.openstack.org/videos/summits/vancouver-2018/can-we-boost-more-hp...
Cheers, Cory
participants (2)
-
Cory Hawkvelt
-
Sean Mooney