Hi all,


This is a continuing thread from 3 years ago here: https://lists.openstack.org/pipermail/openstack-discuss/2022-June/029273.html


We've recently faced the same issue re Nvidia NVLink / NVSwitch,

and I wanted to share how we are solving it.


Given:


- server with 8x Nvidia H200 pGPUs aimed to be used as PCI passthrough

- separate pGPUs can be connected with NVLink but not any with any

- there are 4 specific combinations to use when pairing 2 gpus -

gpu1+gpu3, gpu2+gpu4, gpu5+gpu7, gpu6+gpu8

- and there are only 2 possible combinations of pairing 4 gpus -

gpu{1..4} and gpu{5..8}


The requirement is to have a single flavor that would ask for 2 GPUs,

and the scheduling would pick only appropriate pair where NVLink between

GPUs is possible, and compute will enable/disable NVLink for those GPUs

as appropriate (same for 4 GPUs flavor).


The compute part is solved by a libvirt hook provided by Nvidia here

https://github.com/NVIDIA/libvirt-hooks/. Given the correct devices are chosen

by Nova to be passed into the VM, the hook will enable NVLink between them.


However, that scheduling part is currently rather impossible with Nova as-is.

The best one can do is assign separate trait to appropriate pairs and quartets

of GPUs, and then have 4 flavors for 2-GPU configuration and 2 flavors

for the 4-GPU one.

This if course is very inefficient and brittle, and can not work in a

self-service cloud.


What we are trying to do is the following:


- add a new flavor extra spec `placement:same_subtree`. The value will be passed

directly to placement's allocation candidates API.

- in order for this to be usable in a flavor, we need predictable resource

group names generated for PCI devices. Right now they are auto-generated from

request-id, which won't work.

- hence we extended `pci_passthrough:alias` spec to have an optional

`:group_prefix` part.


Then, what we do is reshape the PCI resource provider tree a bit.

We introduce two extra layers of providers, let's call them `group4` and

`group2`, each group has a specific trait, for example `CUSTOM_PCI_GROUP_OF_2`

and `CUSTOM_PCI_GROUP_OF_4`.

Two `group4` providers have compute nodes as a parent, and are parent to two

`group2` providers each.

Then we re-parent PCI resource providers to `group2` providers, so that each

`group2` and `group4` tree corresponds to a set of GPUs that can be connected

via NVLink.

Luckily enough, resource tracker of nova-compute does not care about the parent

provider, as long as the name and ID of PCI resource providers are not changed,

so the reshape has no effect on the resource tracker.


We also need to use all this with Aggregate Extra Spec filter active,

so we've duplicated the `group_policy` flavor extra spec to

`placement:group_policy` one.


And of course, for all of this to work, one has to enable "PCI in placement"

feature too https://docs.openstack.org/nova/2025.2/admin/pci-passthrough.html#pci-tracking-in-placement


As a result, assuming the PCI alias for the GPUs is `h200`

and its resource class is `PGPU`,

we can create the following flavor extra spec:


pci_passthrough:alias=h200:2:_nvlink2

trait_groupof2:CUSTOM_PCI_GROUP_OF_2=required

placement:group_policy=none

placement:same_subtree=_nvlink2-0,_nvlink2-1,_groupof2


that will result in the following placement query:


GET /allocation_candidates

?resources_nvlink2-0=PGPU:1

&resources_nvlink2-1=PGPU:1

&required_groupof2=CUSTOM_PCI_GROUP_OF_2

&group_policy=none

&same_subtree=_nvlink2-0,_nvlink2-1,_groupof2


That will select only a pair of GPUs that are children of a single provider

with `CUSTOM_PCI_GROUP_OF_2` trait - which is exactly what we need!


You can find this work proposed as a patch at https://review.opendev.org/966205.

This is definitely just a beginning of discussion, and most certainly would

require a spec, so I am eager to hear your thoughts and comments :-)


Cheers,

pas-ha
--
Dr. Pavlo Shchelokovskyy
Principal Software Engineer
Mirantis Inc