[nova] NVLink passthrough

Sean Mooney smooney at redhat.com
Mon Jun 27 12:56:12 UTC 2022

On Sat, 2022-06-25 at 23:05 +0930, Cory Hawkvelt wrote:
> Hey team,
> Is anyone working with NVLink in their clouds? How are you handling passing
> through the right set of GPU's per NVLink 'group'

in terms of nova vGPU support we really only supprot passing a singel vGPU to the
guest. for pci_passthough we dont currently have nay kind of grouping by nvlink group 
or pf  ectra so that is not somethign that is really supproted today.
> I have servers with 4 sets of 2 way NVLinks(8 cards in pairs of 2) and I'm
> able to passthrough the PCI devices to the VM no problem but there is no
> guarantee that the NVLink pair get passed through together and if we end up
> with 1 GPU from one pair and 1 GPU from another pair then we run into all
> sorts of issues(As you'd expect)
> So I'm looking to understand how Nova can be NVLink aware I guess but I'm
> struggling to find any conversation ro material on the topic but I assume
> it's been done before?
today it is not but this could actully be supproted indrectly with teh pci in placment work
that is in flight in zed. i say indirectly as that wont actully allow you to model the nvlink groups
cleanly but https://specs.openstack.org/openstack/nova-specs/specs/zed/approved/pci-device-tracking-in-placement.html
provides a way to list device with traits or cutsom resocue classes.

that would allow you to tag the device in teh whitelist with the nvlink group

device_spec = {
  "address: "<gpu addres>"
  "traits": "CUSTOM_NVLINK_GROUP_1"

the pci alias will then be able to request the gpu using the resouce class and trait

so you coudl create an alias per group
that unfortunetly means we need to  have a flaovr per alias to use all the groups
so you would need 4 flaovor

> I did find this talk [1] which mentioned this problem but they write some
> sort of hack to sit in between nova and qemu, while quite a clever solution
> it seems like there must be a better way to do this in 2022?

you are the first person im aware of to actully ask for nvlink affinity upstream
so no unfortunetly no one as really looked at a better way.

if we can discover the groups once we complte modling pci device in plcmane we coudl model the group stuctor there and
perhaps extend the alias  to allow that to be requested cleanly.

basically we woudl model the groups in the resouce provider tree and use the same subtree request parmater to get the pairs of devicecs.
this woudl be a similar mechanium that we woudl use for pf affinity or anti affintiy.

but no unfortunetly even with master there is not really anything you can do today to enable you use case in nova.

the other related feature that comes up with gpu passthough is we do not supprot multifunction device passthough today.
e.g. on nvidia gpus the audio and graphics endpoint are two differnt pci subfunciton on a single device. the windows
driver expect that and wont work if you pashtough both as they are passed though to the guest as two seperate devices.

> [1] -
> https://www.openstack.org/videos/summits/vancouver-2018/can-we-boost-more-hpc-performance-integrate-ibm-power-servers-with-gpus-to-openstack-environment
> Cheers,
> Cory

More information about the openstack-discuss mailing list