<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Le jeu. 22 juin 2023 à 10:43, Mahendra Paipuri <<a href="mailto:mahendra.paipuri@cnrs.fr">mahendra.paipuri@cnrs.fr</a>> a écrit :<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div>

    <p>Hello all,</p>

    <p>Thanks @Ulrich for sharing the presentation. Very informative!!</p>

    <p>One question : if I understood correctly, <b>time-sliced </b>vGPUs

      <b>absolutely need</b> GRID drivers and licensed clients for the

      vGPUs to work in the guests. For the MIG partitioning, there is <b>no

        need</b> to install GRID drivers in the guest and also <b>no

        need</b> to have licensed clients. Could you confirm if this is

      the actual case?</p></div></blockquote><div><br></div><div>Again, I'm not part of nVidia, neither I'm paid from them, but you can look at their GRID licensing here :</div><div><a href="https://docs.nvidia.com/grid/latest/grid-licensing-user-guide/index.html">https://docs.nvidia.com/grid/latest/grid-licensing-user-guide/index.html</a></div><div><br></div><div>If you also look at the nvidia docs for RHEL support, you need a vCS (virtualComputeServer) licence for Ampere MIG profiles like C-series :</div><div><a href="https://docs.nvidia.com/grid/latest/grid-vgpu-release-notes-red-hat-el-kvm/index.html#hardware-configuration">https://docs.nvidia.com/grid/latest/grid-vgpu-release-notes-red-hat-el-kvm/index.html#hardware-configuration</a></div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>

    <p>Cheers.<br>

    </p>

    <p>Regards</p>

    <p>Mahendra<br>

    </p>

    <div>On 21/06/2023 16:10, Ulrich

      Schwickerath wrote:<br>

    </div>

    <blockquote type="cite">

      <p>Hi, all,</p>

      <p>Sylvain explained quite well how to do it technically. We have

        a PoC running, however, still have some stability issues, as

        mentioned on the summit. We're running the NVIDIA virtualisation

        drivers on the hypervisors and the guests, which requires a

        license from NVIDIA. In our configuration we are still quite

        limited in the sense that we have to configure all cards in the

        same hypervisor in the same way, that is the same MIG

        partitioning. Also, it is not possible to attach more than one

        device to a single VM.<br>

      </p>

      <p>As mentioned in the presentation we are a bit behind with Nova,

        and in the process of fixing this as we speak. Because of that

        we had to do a couple of back ports in Nova to make it work,

        which we hope to be able to get rid of by the ongoing upgrades.<br>

      </p>

      <p>Let me  see if I can make the slides available here. <br>

      </p>

      <p>Cheers, Ulrich<br>

      </p>

      <div>On 20/06/2023 19:07, Oliver Weinmann

        wrote:<br>

      </div>

      <blockquote type="cite"> Hi

        everyone,

        <div><br>

        </div>

        <div>Jumping into this topic again. Unfortunately I haven’t had

          time yet to test Nvidia VGPU in OpenStack but in VMware

          Vsphere. What our users complain most about is the

          inflexibility since you have to use the same profile on all

          vms that use the gpu. One user mentioned to try SLURM. I know

          there is no official OpenStack project for SLURM but I wonder

          if anyone else tried this approach? If I understood correctly

          this would also not require any Nvidia subscription since you

          passthrough the GPU to a single instance and you don’t use

          VGPU nor MIG.</div>

        <div><br>

        </div>

        <div>Cheers,</div>

        <div>Oliver<br>

          <br>

          <div dir="ltr">Von meinem iPhone gesendet</div>

          <div dir="ltr"><br>

            <blockquote type="cite">Am 20.06.2023 um 17:34 schrieb

              Sylvain Bauza <a href="mailto:sbauza@redhat.com" target="_blank"><sbauza@redhat.com></a>:<br>

              <br>

            </blockquote>

          </div>

          <blockquote type="cite">

            <div dir="ltr">

              <div dir="ltr">

                <div dir="ltr"><br>

                </div>

                <br>

                <div class="gmail_quote">

                  <div dir="ltr" class="gmail_attr">Le mar. 20 juin 2023

                    à 16:31, Mahendra Paipuri <<a href="mailto:mahendra.paipuri@cnrs.fr" target="_blank">mahendra.paipuri@cnrs.fr</a>>

                    a écrit :<br>

                  </div>

                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                    <div>

                      <p>Thanks Sylvain for the pointers.</p>

                      <p>One of the questions we have is: can we create

                        MIG profiles on the host and then attach each

                        one or more profile(s) to VMs? This bug [1]

                        reports that once we attach one profile to a VM,

                        rest of MIG profiles become unavailable. From

                        what you have said about using SR-IOV and VFs, I

                        guess this should be possible.<br>

                      </p>

                    </div>

                  </blockquote>

                  <div><br>

                  </div>

                  <div>Correct, what you need is to create first the VFs

                    using sriov-manage and then you can create the MIG

                    instances.</div>

                  <div>Once you create the MIG instances using the

                    profiles you want, you will see that the related

                    available_instances for the nvidia mdev type (by

                    looking at sysfs) will say that you can have a

                    single vGPU for this profile.</div>

                  <div>Then, you can use that mdev type with Nova using

                    nova.conf.</div>

                  <div><br>

                  </div>

                  <div>That being said, while this above is simple, the

                    below talk was saying more about how to correctly

                    use the GPU by the host so please wait :-)</div>

                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                    <div>

                      <p> </p>

                      <p>I think you are talking about "vGPUs with

                        OpenStack Nova" talk on OpenInfra stage. I will

                        look into it once the videos will be online. <br>

                      </p>

                    </div>

                  </blockquote>

                  <div><br>

                  </div>

                  <div>Indeed.</div>

                  <div>-S <br>

                  </div>

                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                    <div>

                      <p> </p>

                      <p>[1] <a href="https://bugs.launchpad.net/nova/+bug/2008883" target="_blank">https://bugs.launchpad.net/nova/+bug/2008883</a></p>

                      <p>Thanks</p>

                      <p>Regards</p>

                      <p>Mahendra<br>

                      </p>

                      <div>On 20/06/2023 15:47, Sylvain Bauza wrote:<br>

                      </div>

                      <blockquote type="cite">

                        <div dir="ltr">

                          <div dir="ltr"><br>

                          </div>

                          <br>

                          <div class="gmail_quote">

                            <div dir="ltr" class="gmail_attr">Le mar. 20

                              juin 2023 à 15:12, PAIPURI Mahendra <<a href="mailto:mahendra.paipuri@cnrs.fr" target="_blank">mahendra.paipuri@cnrs.fr</a>>

                              a écrit :<br>

                            </div>

                            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                              <div>

                                <div id="m_4751079946466621058m_7191695452821608857m_2020284182605405898divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Calibri,Helvetica,sans-serif" dir="ltr">

                                  <p>Hello Ulrich,</p>

                                  <p><br>

                                  </p>

                                  <p>I am relaunching this discussion as

                                    I noticed that you gave a talk about

                                    this topic at OpenInfra Summit in

                                    Vancouver. Is it possible to share

                                    the presentation here? I hope the

                                    talks will be uploaded soon in

                                    YouTube. </p>

                                  <p><br>

                                  </p>

                                  <p>We are mainly interested in using

                                    MIG instances in Openstack cloud and

                                    I could not really find a lot of

                                    information by googling. If you

                                    could share your experiences, that

                                    would be great.</p>

                                  <p><br>

                                  </p>

                                </div>

                              </div>

                            </blockquote>

                            <div><br>

                            </div>

                            <div>Due to scheduling conflicts, I wasn't

                              able to attend Ulrich's session but his

                              feedback will be greatly listened to by

                              me.</div>

                            <div><br>

                            </div>

                            <div>FWIW, there was also a short session

                              about how to enable MIG and play with Nova

                              at the OpenInfra stage (and that one I was

                              able to attend it), and it was quite

                              seamless. What exact information are you

                              looking for ?</div>

                            <div>The idea with MIG is that you need to

                              create SRIOV VFs above the MIG instances

                              using sriov-manage script provided by

                              nvidia so that the mediated devices will

                              use those VFs as the base PCI devices to

                              be used for Nova.</div>

                            <div><br>

                            </div>

                            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                              <div>

                                <div id="m_4751079946466621058m_7191695452821608857m_2020284182605405898divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Calibri,Helvetica,sans-serif" dir="ltr">

                                  <p> </p>

                                  <p>Cheers.</p>

                                  <p><br>

                                  </p>

                                  <p>Regards</p>

                                  <p>Mahendra</p>

                                </div>

                                <hr style="display:inline-block;width:98%">

                                <div id="m_4751079946466621058m_7191695452821608857m_2020284182605405898divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>De :</b> Ulrich

                                    Schwickerath <<a href="mailto:Ulrich.Schwickerath@cern.ch" target="_blank">Ulrich.Schwickerath@cern.ch</a>><br>

                                    <b>Envoyé :</b> lundi 16 janvier

                                    2023 11:38:08<br>

                                    <b>À :</b> <a href="mailto:openstack-discuss@lists.openstack.org" target="_blank">openstack-discuss@lists.openstack.org</a><br>

                                    <b>Objet :</b> Re: 答复: Experience

                                    with VGPUs</font>

                                  <div> </div>

                                </div>

                                <div>

                                  <p>Hi, all,</p>

                                  <p>just to add to the discussion, at

                                    CERN we have recently deployed a

                                    bunch of A100 GPUs in PCI

                                    passthrough mode, and are now

                                    looking into improving their usage

                                    by using MIG. From the NOVA point of

                                    view things seem to work OK, we can

                                    schedule VMs requesting a VGPU, the

                                    client starts up and gets a license

                                    token from our NVIDIA license server

                                    (distributing license keys is our

                                    private cloud is relatively easy in

                                    our case). It's a PoC only for the

                                    time being, and we're not ready to

                                    put that forward as we're facing

                                    issues with CUDA on the client (it

                                    fails immediately in memory

                                    operations with 'not supported',

                                    still investigating why this

                                    happens). <br>

                                  </p>

                                  <p>Once we get that working it would

                                    be nice to be able to have a more

                                    fine grained scheduling so that

                                    people can ask for MIG devices of

                                    different size. The other challenge

                                    is how to set limits on GPU

                                    resources. Once the above issues

                                    have been sorted out we may want to

                                    look into cyborg as well thus we are

                                    quite interested in first

                                    experiences with this.</p>

                                  <p>Kind regards, </p>

                                  <p>Ulrich<br>

                                  </p>

                                  <div>On 13.01.23 21:06, Dmitriy

                                    Rabotyagov wrote:<br>

                                  </div>

                                  <blockquote type="cite">

                                    <div dir="auto">

                                      <div>To have that said, deb/rpm

                                        packages they are providing

                                        doesn't help much, as:

                                        <div dir="auto">* There is no

                                          repo for them, so you need to

                                          download them manually from

                                          enterprise portal</div>

                                        <div dir="auto">* They can't be

                                          upgraded anyway, as driver

                                          version is part of the package

                                          name. And each package

                                          conflicts with any another

                                          one. So you need to explicitly

                                          remove old package and only

                                          then install new one. And yes,

                                          you must stop all VMs before

                                          upgrading driver and no, you

                                          can't live migrate GPU mdev

                                          devices due to that now being

                                          implemented in qemu. So

                                          deb/rpm/generic driver doesn't

                                          matter at the end tbh.</div>

                                        <br>

                                        <br>

                                        <div class="gmail_quote">

                                          <div dir="ltr" class="gmail_attr">пт, 13

                                            янв. 2023 г., 20:56 Cedric

                                            <<a href="mailto:yipikai7@gmail.com" target="_blank">yipikai7@gmail.com</a>>:<br>

                                          </div>

                                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                                            <div dir="auto"><br>

                                              Ended up with the very

                                              same conclusions than

                                              Dimitry regarding the use

                                              of Nvidia Vgrid for the

                                              VGPU use case with Nova,

                                              it works pretty well but:<br>

                                              <br>

                                              - respecting the licensing

                                              model as operationnal

                                              constraints, note that

                                              guests need to reach a

                                              license server in order to

                                              get a token (could be via

                                              the Nvidia SaaS service or

                                              on-prem)<br>

                                              - drivers for both guest

                                              and hypervisor are not

                                              easy to implement and

                                              maintain on large scale. A

                                              year ago, hypervisors

                                              drivers were not packaged

                                              to Debian/Ubuntu, but

                                              builded though a bash

                                              script, thus requiering

                                              additional automatisation

                                              work and careful attention

                                              regarding kernel

                                              update/reboot of Nova

                                              hypervisors.<br>

                                              <br>

                                              Cheers</div>

                                            <br>

                                            <br>

                                            On Fri, Jan 13, 2023 at 4:21

                                            PM Dmitriy Rabotyagov <<a href="mailto:noonedeadpunk@gmail.com" rel="noreferrer noreferrer

                                              noreferrer noreferrer" target="_blank">noonedeadpunk@gmail.com</a>>

                                            wrote:<br>

                                            ><br>

                                            > You are saying that,

                                            like Nvidia GRID drivers are

                                            open-sourced while<br>

                                            > in fact they're super

                                            far from being that. In

                                            order to download<br>

                                            > drivers not only for

                                            hypervisors, but also for

                                            guest VMs you need to<br>

                                            > have an account in

                                            their Enterprise Portal. It

                                            took me roughly 6 weeks<br>

                                            > of discussions with

                                            hardware vendors and Nvidia

                                            support to get a<br>

                                            > proper account there.

                                            And that happened only after

                                            applying for their<br>

                                            > Partner Network (NPN).<br>

                                            > That still doesn't

                                            solve the issue of how to

                                            provide drivers to<br>

                                            > guests, except

                                            pre-build a series of images

                                            with these drivers<br>

                                            > pre-installed (we ended

                                            up with making a DIB element

                                            for that [1]).<br>

                                            > Not saying about the

                                            need to distribute license

                                            tokens for guests and<br>

                                            > the whole mess with

                                            compatibility between

                                            hypervisor and guest drivers<br>

                                            > (as guest driver can't

                                            be newer then host one, and

                                            HVs can't be too<br>

                                            > new either).<br>

                                            ><br>

                                            > It's not that I'm

                                            protecting AMD, but just

                                            saying that Nvidia is not<br>

                                            > that straightforward

                                            either, and at least on

                                            paper AMD vGPUs look<br>

                                            > easier both for

                                            operators and end-users.<br>

                                            ><br>

                                            > [1] <a href="https://github.com/citynetwork/dib-elements/tree/main/nvgrid" rel="noreferrer noreferrer

                                              noreferrer noreferrer

                                              noreferrer" target="_blank">

https://github.com/citynetwork/dib-elements/tree/main/nvgrid</a><br>

                                            ><br>

                                            > ><br>

                                            > > As for AMD cards,

                                            AMD stated that some of

                                            their MI series card

                                            supports SR-IOV for vGPUs.

                                            However, those drivers are

                                            never open source or

                                            provided closed source to

                                            public, only large cloud

                                            providers are able to get

                                            them. So I don't really

                                            recommend getting AMD cards

                                            for vGPU unless you are able

                                            to get support from them.<br>

                                            > ><br>

                                            ><br>

                                          </blockquote>

                                        </div>

                                      </div>

                                    </div>

                                  </blockquote>

                                </div>

                              </div>

                            </blockquote>

                          </div>

                        </div>

                      </blockquote>

                    </div>

                  </blockquote>

                </div>

              </div>

            </div>

          </blockquote>

        </div>

      </blockquote>

    </blockquote>

  </div>

</blockquote></div></div>