Hi all,
This is a continuing thread from 3 years ago here: https://lists.openstack.org/pipermail/openstack-discuss/2022-June/029273.html
We've recently faced the same issue re Nvidia NVLink / NVSwitch,
and I wanted to share how we are solving it.
Given:
- server with 8x Nvidia H200 pGPUs aimed to be used as PCI passthrough
- separate pGPUs can be connected with NVLink but not any with any
- there are 4 specific combinations to use when pairing 2 gpus -
gpu1+gpu3, gpu2+gpu4, gpu5+gpu7, gpu6+gpu8
- and there are only 2 possible combinations of pairing 4 gpus -
gpu{1..4} and gpu{5..8}
The requirement is to have a single flavor that would ask for 2 GPUs,
and the scheduling would pick only appropriate pair where NVLink between
GPUs is possible, and compute will enable/disable NVLink for those GPUs
as appropriate (same for 4 GPUs flavor).
The compute part is solved by a libvirt hook provided by Nvidia here
https://github.com/NVIDIA/libvirt-hooks/. Given the correct devices are chosen
by Nova to be passed into the VM, the hook will enable NVLink between them.
However, that scheduling part is currently rather impossible with Nova as-is.
The best one can do is assign separate trait to appropriate pairs and quartets
of GPUs, and then have 4 flavors for 2-GPU configuration and 2 flavors
for the 4-GPU one.
This if course is very inefficient and brittle, and can not work in a
self-service cloud.
What we are trying to do is the following:
- add a new flavor extra spec `placement:same_subtree`. The value will be passed
directly to placement's allocation candidates API.
- in order for this to be usable in a flavor, we need predictable resource
group names generated for PCI devices. Right now they are auto-generated from
request-id, which won't work.
- hence we extended `pci_passthrough:alias` spec to have an optional
`:group_prefix` part.
Then, what we do is reshape the PCI resource provider tree a bit.
We introduce two extra layers of providers, let's call them `group4` and
`group2`, each group has a specific trait, for example `CUSTOM_PCI_GROUP_OF_2`
and `CUSTOM_PCI_GROUP_OF_4`.
Two `group4` providers have compute nodes as a parent, and are parent to two
`group2` providers each.
Then we re-parent PCI resource providers to `group2` providers, so that each
`group2` and `group4` tree corresponds to a set of GPUs that can be connected
via NVLink.
Luckily enough, resource tracker of nova-compute does not care about the parent
provider, as long as the name and ID of PCI resource providers are not changed,
so the reshape has no effect on the resource tracker.
We also need to use all this with Aggregate Extra Spec filter active,
so we've duplicated the `group_policy` flavor extra spec to
`placement:group_policy` one.
And of course, for all of this to work, one has to enable "PCI in placement"
feature too https://docs.openstack.org/nova/2025.2/admin/pci-passthrough.html#pci-tracking-in-placement
As a result, assuming the PCI alias for the GPUs is `h200`
and its resource class is `PGPU`,
we can create the following flavor extra spec:
pci_passthrough:alias=h200:2:_nvlink2
trait_groupof2:CUSTOM_PCI_GROUP_OF_2=required
placement:group_policy=none
placement:same_subtree=_nvlink2-0,_nvlink2-1,_groupof2
that will result in the following placement query:
GET /allocation_candidates
?resources_nvlink2-0=PGPU:1
&resources_nvlink2-1=PGPU:1
&required_groupof2=CUSTOM_PCI_GROUP_OF_2
&group_policy=none
&same_subtree=_nvlink2-0,_nvlink2-1,_groupof2
That will select only a pair of GPUs that are children of a single provider
with `CUSTOM_PCI_GROUP_OF_2` trait - which is exactly what we need!
You can find this work proposed as a patch at https://review.opendev.org/966205.
This is definitely just a beginning of discussion, and most certainly would
require a spec, so I am eager to hear your thoughts and comments :-)
Cheers,
pas-ha