<html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>Hi, all,</p>

    <p>Sylvain explained quite well how to do it technically. We have a

      PoC running, however, still have some stability issues, as

      mentioned on the summit. We're running the NVIDIA virtualisation

      drivers on the hypervisors and the guests, which requires a

      license from NVIDIA. In our configuration we are still quite

      limited in the sense that we have to configure all cards in the

      same hypervisor in the same way, that is the same MIG

      partitioning. Also, it is not possible to attach more than one

      device to a single VM.<br>

    </p>

    <p>As mentioned in the presentation we are a bit behind with Nova,

      and in the process of fixing this as we speak. Because of that we

      had to do a couple of back ports in Nova to make it work, which we

      hope to be able to get rid of by the ongoing upgrades.<br>

    </p>

    <p>Let me  see if I can make the slides available here. <br>

    </p>

    <p>Cheers, Ulrich<br>

    </p>

    <div class="moz-cite-prefix">On 20/06/2023 19:07, Oliver Weinmann

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:2DD18791-4BFD-4FF0-AAAC-77D8C18FB138@me.com">

      Hi everyone,

      <div><br>

      </div>

      <div>Jumping into this topic again. Unfortunately I haven’t had

        time yet to test Nvidia VGPU in OpenStack but in VMware Vsphere.

        What our users complain most about is the inflexibility since

        you have to use the same profile on all vms that use the gpu.

        One user mentioned to try SLURM. I know there is no official

        OpenStack project for SLURM but I wonder if anyone else tried

        this approach? If I understood correctly this would also not

        require any Nvidia subscription since you passthrough the GPU to

        a single instance and you don’t use VGPU nor MIG.</div>

      <div><br>

      </div>

      <div>Cheers,</div>

      <div>Oliver<br>

        <br>

        <div dir="ltr">Von meinem iPhone gesendet</div>

        <div dir="ltr"><br>

          <blockquote type="cite">Am 20.06.2023 um 17:34 schrieb Sylvain

            Bauza <a class="moz-txt-link-rfc2396E" href="mailto:sbauza@redhat.com"><sbauza@redhat.com></a>:<br>

            <br>

          </blockquote>

        </div>

        <blockquote type="cite">

          <div dir="ltr">

            <div dir="ltr">

              <div dir="ltr"><br>

              </div>

              <br>

              <div class="gmail_quote">

                <div dir="ltr" class="gmail_attr">Le mar. 20 juin 2023

                  à 16:31, Mahendra Paipuri <<a href="mailto:mahendra.paipuri@cnrs.fr" moz-do-not-send="true" class="moz-txt-link-freetext">mahendra.paipuri@cnrs.fr</a>>

                  a écrit :<br>

                </div>

                <blockquote class="gmail_quote" style="margin:0px 0px

                  0px 0.8ex;border-left:1px solid

                  rgb(204,204,204);padding-left:1ex">

                  <div>

                    <p>Thanks Sylvain for the pointers.</p>

                    <p>One of the questions we have is: can we create

                      MIG profiles on the host and then attach each one

                      or more profile(s) to VMs? This bug [1] reports

                      that once we attach one profile to a VM, rest of

                      MIG profiles become unavailable. From what you

                      have said about using SR-IOV and VFs, I guess this

                      should be possible.<br>

                    </p>

                  </div>

                </blockquote>

                <div><br>

                </div>

                <div>Correct, what you need is to create first the VFs

                  using sriov-manage and then you can create the MIG

                  instances.</div>

                <div>Once you create the MIG instances using the

                  profiles you want, you will see that the related

                  available_instances for the nvidia mdev type (by

                  looking at sysfs) will say that you can have a single

                  vGPU for this profile.</div>

                <div>Then, you can use that mdev type with Nova using

                  nova.conf.</div>

                <div><br>

                </div>

                <div>That being said, while this above is simple, the

                  below talk was saying more about how to correctly use

                  the GPU by the host so please wait :-)</div>

                <blockquote class="gmail_quote" style="margin:0px 0px

                  0px 0.8ex;border-left:1px solid

                  rgb(204,204,204);padding-left:1ex">

                  <div>

                    <p> </p>

                    <p>I think you are talking about "vGPUs with

                      OpenStack Nova" talk on OpenInfra stage. I will

                      look into it once the videos will be online. <br>

                    </p>

                  </div>

                </blockquote>

                <div><br>

                </div>

                <div>Indeed.</div>

                <div>-S <br>

                </div>

                <blockquote class="gmail_quote" style="margin:0px 0px

                  0px 0.8ex;border-left:1px solid

                  rgb(204,204,204);padding-left:1ex">

                  <div>

                    <p> </p>

                    <p>[1] <a href="https://bugs.launchpad.net/nova/+bug/2008883" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://bugs.launchpad.net/nova/+bug/2008883</a></p>

                    <p>Thanks</p>

                    <p>Regards</p>

                    <p>Mahendra<br>

                    </p>

                    <div>On 20/06/2023 15:47, Sylvain Bauza wrote:<br>

                    </div>

                    <blockquote type="cite">

                      <div dir="ltr">

                        <div dir="ltr"><br>

                        </div>

                        <br>

                        <div class="gmail_quote">

                          <div dir="ltr" class="gmail_attr">Le mar. 20

                            juin 2023 à 15:12, PAIPURI Mahendra <<a href="mailto:mahendra.paipuri@cnrs.fr" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">mahendra.paipuri@cnrs.fr</a>>

                            a écrit :<br>

                          </div>

                          <blockquote class="gmail_quote" style="margin:0px 0px 0px

                            0.8ex;border-left:1px solid

                            rgb(204,204,204);padding-left:1ex">

                            <div>

                              <div id="m_7191695452821608857m_2020284182605405898divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Calibri,Helvetica,sans-serif" dir="ltr">

                                <p>Hello Ulrich,</p>

                                <p><br>

                                </p>

                                <p>I am relaunching this discussion as I

                                  noticed that you gave a talk about

                                  this topic at OpenInfra Summit in

                                  Vancouver. Is it possible to share the

                                  presentation here? I hope the talks

                                  will be uploaded soon in YouTube. </p>

                                <p><br>

                                </p>

                                <p>We are mainly interested in using MIG

                                  instances in Openstack cloud and I

                                  could not really find a lot of

                                  information by googling. If you could

                                  share your experiences, that would be

                                  great.</p>

                                <p><br>

                                </p>

                              </div>

                            </div>

                          </blockquote>

                          <div><br>

                          </div>

                          <div>Due to scheduling conflicts, I wasn't

                            able to attend Ulrich's session but his

                            feedback will be greatly listened to by me.</div>

                          <div><br>

                          </div>

                          <div>FWIW, there was also a short session

                            about how to enable MIG and play with Nova

                            at the OpenInfra stage (and that one I was

                            able to attend it), and it was quite

                            seamless. What exact information are you

                            looking for ?</div>

                          <div>The idea with MIG is that you need to

                            create SRIOV VFs above the MIG instances

                            using sriov-manage script provided by nvidia

                            so that the mediated devices will use those

                            VFs as the base PCI devices to be used for

                            Nova.</div>

                          <div><br>

                          </div>

                          <blockquote class="gmail_quote" style="margin:0px 0px 0px

                            0.8ex;border-left:1px solid

                            rgb(204,204,204);padding-left:1ex">

                            <div>

                              <div id="m_7191695452821608857m_2020284182605405898divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Calibri,Helvetica,sans-serif" dir="ltr">

                                <p> </p>

                                <p>Cheers.</p>

                                <p><br>

                                </p>

                                <p>Regards</p>

                                <p>Mahendra</p>

                              </div>

                              <hr style="display:inline-block;width:98%">

                              <div id="m_7191695452821608857m_2020284182605405898divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>De :</b> Ulrich

                                  Schwickerath <<a href="mailto:Ulrich.Schwickerath@cern.ch" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">Ulrich.Schwickerath@cern.ch</a>><br>

                                  <b>Envoyé :</b> lundi 16 janvier 2023

                                  11:38:08<br>

                                  <b>À :</b> <a href="mailto:openstack-discuss@lists.openstack.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">openstack-discuss@lists.openstack.org</a><br>

                                  <b>Objet :</b> Re: 答复: Experience with

                                  VGPUs</font>

                                <div> </div>

                              </div>

                              <div>

                                <p>Hi, all,</p>

                                <p>just to add to the discussion, at

                                  CERN we have recently deployed a bunch

                                  of A100 GPUs in PCI passthrough mode,

                                  and are now looking into improving

                                  their usage by using MIG. From the

                                  NOVA point of view things seem to work

                                  OK, we can schedule VMs requesting a

                                  VGPU, the client starts up and gets a

                                  license token from our NVIDIA license

                                  server (distributing license keys is

                                  our private cloud is relatively easy

                                  in our case). It's a PoC only for the

                                  time being, and we're not ready to put

                                  that forward as we're facing issues

                                  with CUDA on the client (it fails

                                  immediately in memory operations with

                                  'not supported', still investigating

                                  why this happens). <br>

                                </p>

                                <p>Once we get that working it would be

                                  nice to be able to have a more fine

                                  grained scheduling so that people can

                                  ask for MIG devices of different size.

                                  The other challenge is how to set

                                  limits on GPU resources. Once the

                                  above issues have been sorted out we

                                  may want to look into cyborg as well

                                  thus we are quite interested in first

                                  experiences with this.</p>

                                <p>Kind regards, </p>

                                <p>Ulrich<br>

                                </p>

                                <div>On 13.01.23 21:06, Dmitriy

                                  Rabotyagov wrote:<br>

                                </div>

                                <blockquote type="cite">

                                  <div dir="auto">

                                    <div>To have that said, deb/rpm

                                      packages they are providing

                                      doesn't help much, as:

                                      <div dir="auto">* There is no repo

                                        for them, so you need to

                                        download them manually from

                                        enterprise portal</div>

                                      <div dir="auto">* They can't be

                                        upgraded anyway, as driver

                                        version is part of the package

                                        name. And each package conflicts

                                        with any another one. So you

                                        need to explicitly remove old

                                        package and only then install

                                        new one. And yes, you must stop

                                        all VMs before upgrading driver

                                        and no, you can't live migrate

                                        GPU mdev devices due to that now

                                        being implemented in qemu. So

                                        deb/rpm/generic driver doesn't

                                        matter at the end tbh.</div>

                                      <br>

                                      <br>

                                      <div class="gmail_quote">

                                        <div dir="ltr" class="gmail_attr">пт, 13 янв.

                                          2023 г., 20:56 Cedric <<a href="mailto:yipikai7@gmail.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">yipikai7@gmail.com</a>>:<br>

                                        </div>

                                        <blockquote class="gmail_quote" style="margin:0px 0px 0px

                                          0.8ex;border-left:1px solid

                                          rgb(204,204,204);padding-left:1ex">

                                          <div dir="auto"><br>

                                            Ended up with the very same

                                            conclusions than Dimitry

                                            regarding the use of Nvidia

                                            Vgrid for the VGPU use case

                                            with Nova, it works pretty

                                            well but:<br>

                                            <br>

                                            - respecting the licensing

                                            model as operationnal

                                            constraints, note that

                                            guests need to reach a

                                            license server in order to

                                            get a token (could be via

                                            the Nvidia SaaS service or

                                            on-prem)<br>

                                            - drivers for both guest and

                                            hypervisor are not easy to

                                            implement and maintain on

                                            large scale. A year ago,

                                            hypervisors drivers were not

                                            packaged to Debian/Ubuntu,

                                            but builded though a bash

                                            script, thus requiering

                                            additional automatisation

                                            work and careful attention

                                            regarding kernel

                                            update/reboot of Nova

                                            hypervisors.<br>

                                            <br>

                                            Cheers</div>

                                          <br>

                                          <br>

                                          On Fri, Jan 13, 2023 at 4:21

                                          PM Dmitriy Rabotyagov <<a href="mailto:noonedeadpunk@gmail.com" rel="noreferrer noreferrer

                                            noreferrer noreferrer" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">noonedeadpunk@gmail.com</a>>

                                          wrote:<br>

                                          ><br>

                                          > You are saying that, like

                                          Nvidia GRID drivers are

                                          open-sourced while<br>

                                          > in fact they're super far

                                          from being that. In order to

                                          download<br>

                                          > drivers not only for

                                          hypervisors, but also for

                                          guest VMs you need to<br>

                                          > have an account in their

                                          Enterprise Portal. It took me

                                          roughly 6 weeks<br>

                                          > of discussions with

                                          hardware vendors and Nvidia

                                          support to get a<br>

                                          > proper account there. And

                                          that happened only after

                                          applying for their<br>

                                          > Partner Network (NPN).<br>

                                          > That still doesn't solve

                                          the issue of how to provide

                                          drivers to<br>

                                          > guests, except pre-build

                                          a series of images with these

                                          drivers<br>

                                          > pre-installed (we ended

                                          up with making a DIB element

                                          for that [1]).<br>

                                          > Not saying about the need

                                          to distribute license tokens

                                          for guests and<br>

                                          > the whole mess with

                                          compatibility between

                                          hypervisor and guest drivers<br>

                                          > (as guest driver can't be

                                          newer then host one, and HVs

                                          can't be too<br>

                                          > new either).<br>

                                          ><br>

                                          > It's not that I'm

                                          protecting AMD, but just

                                          saying that Nvidia is not<br>

                                          > that straightforward

                                          either, and at least on paper

                                          AMD vGPUs look<br>

                                          > easier both for operators

                                          and end-users.<br>

                                          ><br>

                                          > [1] <a href="https://github.com/citynetwork/dib-elements/tree/main/nvgrid" rel="noreferrer noreferrer

                                            noreferrer noreferrer

                                            noreferrer" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">

https://github.com/citynetwork/dib-elements/tree/main/nvgrid</a><br>

                                          ><br>

                                          > ><br>

                                          > > As for AMD cards,

                                          AMD stated that some of their

                                          MI series card supports SR-IOV

                                          for vGPUs. However, those

                                          drivers are never open source

                                          or provided closed source to

                                          public, only large cloud

                                          providers are able to get

                                          them. So I don't really

                                          recommend getting AMD cards

                                          for vGPU unless you are able

                                          to get support from them.<br>

                                          > ><br>

                                          ><br>

                                        </blockquote>

                                      </div>

                                    </div>

                                  </div>

                                </blockquote>

                              </div>

                            </div>

                          </blockquote>

                        </div>

                      </div>

                    </blockquote>

                  </div>

                </blockquote>

              </div>

            </div>

          </div>

        </blockquote>

      </div>

    </blockquote>

  </body>

</html>