Re: Experience with VGPUs

22 Jun 2023

      Le mer. 21 juin 2023 à 18:23, Dmitriy Rabotyagov <noonedeadpunk@gmail.com>
a écrit :
...
I can recall in quite recent release notes in Nvidia drivers, that now
they do allow attaching multiple vGPUs to a single VM, but I can recall
Sylvain said that is not exactly as it sounds like and there're severe
limitations to this advertised feature.
That's the problem with this feature enablement in Nova : we mostly depend
on a very specific external Linux driver. So, tbc, if you want to use vGPU,
please rather look at the Nvidia documentation *before* :)
About multiple vGPUs, Nvidia says it depends on the GPU architecture (and
that was changing since the last years) :

(quoting Nvidia here)
*The supported vGPUs depend on the architecture of the GPU on which the
vGPUs reside: *

   - *For GPUs based on the NVIDIA Volta architecture and later GPU
   architectures, all Q-series and C-series vGPUs are supported. On GPUs that
   support the Multi-Instance GPU (MIG) feature, both time-sliced and
   MIG-backed vGPUs are supported. *
   - *For GPUs based on the NVIDIA Pascal™ architecture, only Q-series and
   C-series vGPUs that are allocated all of the physical GPU's frame buffer
   are supported. *
   - *For GPUs based on the NVIDIA NVIDIA Maxwell™ graphic architecture,
   only Q-series vGPUs that are allocated all of the physical GPU's frame
   buffer are supported. *

*You can assign multiple vGPUs with differing amounts of frame buffer to a
single VM, provided the board type and the series of all the vGPUs is the
same. For example, you can assign an A40-48C vGPU and an A40-16C vGPU to
the same VM. However, you cannot assign an A30-8C vGPU and an A16-8C vGPU
to the same VM. *
https://docs.nvidia.com/grid/latest/grid-vgpu-release-notes-red-hat-el-kvm/i...
As a reminder, you can find the vGPU types here
https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#virtual-...
Basically, what changed is that with the latest Volta and Ampere
architecture, Nvidia was able to provide different vGPUs with sliced frame
buffer recently, while previously Nvidia was only able to pin a vGPU taking
the whole pGPU frame buffer to a single VM, which was actually limiting de
facto the instance to only have one single vGPU attached (or having a
second vGPU attached from another pGPU, which is non trivial to schedule)

For that reason, we initially limited the VGPU allocation requests to a
maximum of 1 in Nova since it was horribly depending on hardware, but I
eventually tried to propose to remove that limitation with
https://review.opendev.org/c/openstack/nova/+/845757 which would need some
further work and testing (which is nearly impossible with upstream CI since
the nvidia drivers are proprietary and licensed).
Some operator wanting to lift that current limitation would get all my
attention if he/she would volunteer for *testing* such patch. Ping me on
IRC #openstack-nova (bauzas) and we could proceed quickly.
...
Also I think in MIG mode it's possible to split GPU in a subset of
supported (but different) flavors, though I have close to no idea how
scheduling would be done in this case.
This is quite simple : you need to create different MIG instances using
different heterogenous profiles and you'll see then that *some* mdev types
will accordingly have an inventory of 1.
You could then use some new feature we introduced in Xena, which allows the
nova libvirt driver to create different custom resource classes :
https://specs.openstack.org/openstack/nova-specs/specs/xena/implemented/gene...

Again, testing this on real production is the crux of the problem. We
provided as many functional tests as we were able in order to verify such
things, but getting a real MIG-backed GPU and setting the confs
appropriately is something we are missing and which would be useful for
tracking bugs.

Last point, I'm more than open to collaborating with CERN or any other
operator wanting to stabilize the vGPU feature enablement in Nova. I know
that the existing feature presents a quite long list of bug reports and has
some severe limitations, but I'd be more happy with having some guidance
from the operators on how and what to stabilize.

-Sylvain

On Wed, Jun 21, 2023, 17:36 Ulrich Schwickerath <ulrich.schwickerath@cern.ch>
...
wrote:
...
Hi, again,
here's a link to my slides:
https://cernbox.cern.ch/s/v3YCyJjrZZv55H2
Let me know if it works.
Cheers, Ulrich

Re: Experience with VGPUs

Sylvain Bauza