Experience with VGPUs

Sylvain Bauza sbauza at redhat.com
Tue Jan 17 10:04:59 UTC 2023


Le mar. 17 janv. 2023 à 10:00, Tobias Urdin <tobias.urdin at binero.com> a
écrit :

> Hello,
>
> We are using vGPUs with Nova on OpenStack Xena release and we’ve had a
> fairly good experience integration
> NVIDIA A10 GPUs into our cloud.
>
>
Great to hear, thanks for your feedback, much appreciated Tobias.


> As we see it there is some painpoints that just goes with mantaining the
> GPU feature.
>
> - There is a very tight coupling of the NVIDIA driver in the guest
> (instance) and on the compute node that needs to
>   be managed.
>
>
As nvidia provides proprietary drivers, there isn't much we can move on
upstream, even for CI testing.
Many participants in this thread explained this as a common concern and I
understand their pain, but yeah you need third-party tooling for managing
both the driver installation and the licensing servers.


> - Doing maintainance need more planning i.e powering off instances, NVIDIA
> driver on compute node needs to be
>   rebuilt on hypervisor if kernel is upgraded unless you’ve implemented
> DKMS for that.
>
>
Ditto, unfortunately I wish the driver could be less kernel-dependent but I
don't see a foreseenable future for this.



> - Because we’ve different flavor of GPU (we split the A10 cards into
> different flavors for maximum utilization of
>   other compute resources) we added custom traits in the Placement service
> to handle that, handling that with
>   a script since doing anything manually related to GPUs you will get
> confused quickly. [1]
>

True, that's why you can also use generic mdevs which will create different
resource classes (but ssssht) or use the placement.yaml file to manage your
inventories.
https://specs.openstack.org/openstack/nova-specs/specs/xena/implemented/generic-mdevs.html


> - Since Nova does not handle recreation of mdevs (or use the new libvirt
> autostart feature for mdevs) we have
>   a systemd unit that executes before the nova-compute service that walks
> all the libvirt domains and does lookups
>   in Placement to recreate the mdevs before nova-compute start. [2] [3] [4]
>
>
This is a known issue and we agreed on the last PTG for a direction.
Patches on review.
https://review.opendev.org/c/openstack/nova/+/864418

Thanks,
-Sylvain


> Best regards
> Tobias
>
> DISCLAIMER: Below is provided without any warranty of actually working for
> you or your setup and does
> very specific things that we need and is only provided to give you some
> insight and help. Use at your own risk.
>
> [1] https://paste.opendev.org/show/b6FdfwDHnyJXR0G3XarE/
> [2] https://paste.opendev.org/show/bGtO6aIE519uysvytWv0/
> [3] https://paste.opendev.org/show/bftOEIPxlpLptkosxlL6/
> [4] https://paste.opendev.org/show/bOYBV6lhRON4ntQKYPkb/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230117/7b9455f2/attachment-0001.htm>


More information about the openstack-discuss mailing list