Le mar. 17 janv. 2023 à 10:00, Tobias Urdin <tobias.urdin@binero.com> a écrit :
Hello,

We are using vGPUs with Nova on OpenStack Xena release and we’ve had a fairly good experience integration
NVIDIA A10 GPUs into our cloud.


Great to hear, thanks for your feedback, much appreciated Tobias.
 
As we see it there is some painpoints that just goes with mantaining the GPU feature.

- There is a very tight coupling of the NVIDIA driver in the guest (instance) and on the compute node that needs to
  be managed.


As nvidia provides proprietary drivers, there isn't much we can move on upstream, even for CI testing.
Many participants in this thread explained this as a common concern and I understand their pain, but yeah you need third-party tooling for managing both the driver installation and the licensing servers.
 
- Doing maintainance need more planning i.e powering off instances, NVIDIA driver on compute node needs to be
  rebuilt on hypervisor if kernel is upgraded unless you’ve implemented DKMS for that.


Ditto, unfortunately I wish the driver could be less kernel-dependent but I don't see a foreseenable future for this.

 
- Because we’ve different flavor of GPU (we split the A10 cards into different flavors for maximum utilization of
  other compute resources) we added custom traits in the Placement service to handle that, handling that with
  a script since doing anything manually related to GPUs you will get confused quickly. [1]

True, that's why you can also use generic mdevs which will create different resource classes (but ssssht) or use the placement.yaml file to manage your inventories.
https://specs.openstack.org/openstack/nova-specs/specs/xena/implemented/generic-mdevs.html


- Since Nova does not handle recreation of mdevs (or use the new libvirt autostart feature for mdevs) we have
  a systemd unit that executes before the nova-compute service that walks all the libvirt domains and does lookups
  in Placement to recreate the mdevs before nova-compute start. [2] [3] [4]


This is a known issue and we agreed on the last PTG for a direction. Patches on review.
https://review.opendev.org/c/openstack/nova/+/864418

Thanks,
-Sylvain
 
Best regards
Tobias

DISCLAIMER: Below is provided without any warranty of actually working for you or your setup and does
very specific things that we need and is only provided to give you some insight and help. Use at your own risk.

[1] https://paste.opendev.org/show/b6FdfwDHnyJXR0G3XarE/
[2] https://paste.opendev.org/show/bGtO6aIE519uysvytWv0/
[3] https://paste.opendev.org/show/bftOEIPxlpLptkosxlL6/
[4] https://paste.opendev.org/show/bOYBV6lhRON4ntQKYPkb/