<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p>Hi, all,</p>
    <p>Sylvain explained quite well how to do it technically. We have a
      PoC running, however, still have some stability issues, as
      mentioned on the summit. We're running the NVIDIA virtualisation
      drivers on the hypervisors and the guests, which requires a
      license from NVIDIA. In our configuration we are still quite
      limited in the sense that we have to configure all cards in the
      same hypervisor in the same way, that is the same MIG
      partitioning. Also, it is not possible to attach more than one
      device to a single VM.<br>
    </p>
    <p>As mentioned in the presentation we are a bit behind with Nova,
      and in the process of fixing this as we speak. Because of that we
      had to do a couple of back ports in Nova to make it work, which we
      hope to be able to get rid of by the ongoing upgrades.<br>
    </p>
    <p>Let me  see if I can make the slides available here. <br>
    </p>
    <p>Cheers, Ulrich<br>
    </p>
    <div class="moz-cite-prefix">On 20/06/2023 19:07, Oliver Weinmann
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:2DD18791-4BFD-4FF0-AAAC-77D8C18FB138@me.com">
      
      Hi everyone,
      <div><br>
      </div>
      <div>Jumping into this topic again. Unfortunately I haven’t had
        time yet to test Nvidia VGPU in OpenStack but in VMware Vsphere.
        What our users complain most about is the inflexibility since
        you have to use the same profile on all vms that use the gpu.
        One user mentioned to try SLURM. I know there is no official
        OpenStack project for SLURM but I wonder if anyone else tried
        this approach? If I understood correctly this would also not
        require any Nvidia subscription since you passthrough the GPU to
        a single instance and you don’t use VGPU nor MIG.</div>
      <div><br>
      </div>
      <div>Cheers,</div>
      <div>Oliver<br>
        <br>
        <div dir="ltr">Von meinem iPhone gesendet</div>
        <div dir="ltr"><br>
          <blockquote type="cite">Am 20.06.2023 um 17:34 schrieb Sylvain
            Bauza <a class="moz-txt-link-rfc2396E" href="mailto:sbauza@redhat.com"><sbauza@redhat.com></a>:<br>
            <br>
          </blockquote>
        </div>
        <blockquote type="cite">
          <div dir="ltr">
            <div dir="ltr">
              <div dir="ltr"><br>
              </div>
              <br>
              <div class="gmail_quote">
                <div dir="ltr" class="gmail_attr">Le mar. 20 juin 2023
                  à 16:31, Mahendra Paipuri <<a href="mailto:mahendra.paipuri@cnrs.fr" moz-do-not-send="true" class="moz-txt-link-freetext">mahendra.paipuri@cnrs.fr</a>>
                  a écrit :<br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px
                  0px 0.8ex;border-left:1px solid
                  rgb(204,204,204);padding-left:1ex">
                  <div>
                    <p>Thanks Sylvain for the pointers.</p>
                    <p>One of the questions we have is: can we create
                      MIG profiles on the host and then attach each one
                      or more profile(s) to VMs? This bug [1] reports
                      that once we attach one profile to a VM, rest of
                      MIG profiles become unavailable. From what you
                      have said about using SR-IOV and VFs, I guess this
                      should be possible.<br>
                    </p>
                  </div>
                </blockquote>
                <div><br>
                </div>
                <div>Correct, what you need is to create first the VFs
                  using sriov-manage and then you can create the MIG
                  instances.</div>
                <div>Once you create the MIG instances using the
                  profiles you want, you will see that the related
                  available_instances for the nvidia mdev type (by
                  looking at sysfs) will say that you can have a single
                  vGPU for this profile.</div>
                <div>Then, you can use that mdev type with Nova using
                  nova.conf.</div>
                <div><br>
                </div>
                <div>That being said, while this above is simple, the
                  below talk was saying more about how to correctly use
                  the GPU by the host so please wait :-)</div>
                <blockquote class="gmail_quote" style="margin:0px 0px
                  0px 0.8ex;border-left:1px solid
                  rgb(204,204,204);padding-left:1ex">
                  <div>
                    <p> </p>
                    <p>I think you are talking about "vGPUs with
                      OpenStack Nova" talk on OpenInfra stage. I will
                      look into it once the videos will be online. <br>
                    </p>
                  </div>
                </blockquote>
                <div><br>
                </div>
                <div>Indeed.</div>
                <div>-S <br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px
                  0px 0.8ex;border-left:1px solid
                  rgb(204,204,204);padding-left:1ex">
                  <div>
                    <p> </p>
                    <p>[1] <a href="https://bugs.launchpad.net/nova/+bug/2008883" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://bugs.launchpad.net/nova/+bug/2008883</a></p>
                    <p>Thanks</p>
                    <p>Regards</p>
                    <p>Mahendra<br>
                    </p>
                    <div>On 20/06/2023 15:47, Sylvain Bauza wrote:<br>
                    </div>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div dir="ltr"><br>
                        </div>
                        <br>
                        <div class="gmail_quote">
                          <div dir="ltr" class="gmail_attr">Le mar. 20
                            juin 2023 à 15:12, PAIPURI Mahendra <<a href="mailto:mahendra.paipuri@cnrs.fr" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">mahendra.paipuri@cnrs.fr</a>>
                            a écrit :<br>
                          </div>
                          <blockquote class="gmail_quote" style="margin:0px 0px 0px
                            0.8ex;border-left:1px solid
                            rgb(204,204,204);padding-left:1ex">
                            <div>
                              <div id="m_7191695452821608857m_2020284182605405898divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Calibri,Helvetica,sans-serif" dir="ltr">
                                <p>Hello Ulrich,</p>
                                <p><br>
                                </p>
                                <p>I am relaunching this discussion as I
                                  noticed that you gave a talk about
                                  this topic at OpenInfra Summit in
                                  Vancouver. Is it possible to share the
                                  presentation here? I hope the talks
                                  will be uploaded soon in YouTube. </p>
                                <p><br>
                                </p>
                                <p>We are mainly interested in using MIG
                                  instances in Openstack cloud and I
                                  could not really find a lot of
                                  information by googling. If you could
                                  share your experiences, that would be
                                  great.</p>
                                <p><br>
                                </p>
                              </div>
                            </div>
                          </blockquote>
                          <div><br>
                          </div>
                          <div>Due to scheduling conflicts, I wasn't
                            able to attend Ulrich's session but his
                            feedback will be greatly listened to by me.</div>
                          <div><br>
                          </div>
                          <div>FWIW, there was also a short session
                            about how to enable MIG and play with Nova
                            at the OpenInfra stage (and that one I was
                            able to attend it), and it was quite
                            seamless. What exact information are you
                            looking for ?</div>
                          <div>The idea with MIG is that you need to
                            create SRIOV VFs above the MIG instances
                            using sriov-manage script provided by nvidia
                            so that the mediated devices will use those
                            VFs as the base PCI devices to be used for
                            Nova.</div>
                          <div><br>
                          </div>
                          <blockquote class="gmail_quote" style="margin:0px 0px 0px
                            0.8ex;border-left:1px solid
                            rgb(204,204,204);padding-left:1ex">
                            <div>
                              <div id="m_7191695452821608857m_2020284182605405898divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Calibri,Helvetica,sans-serif" dir="ltr">
                                <p> </p>
                                <p>Cheers.</p>
                                <p><br>
                                </p>
                                <p>Regards</p>
                                <p>Mahendra</p>
                              </div>
                              <hr style="display:inline-block;width:98%">
                              <div id="m_7191695452821608857m_2020284182605405898divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>De :</b> Ulrich
                                  Schwickerath <<a href="mailto:Ulrich.Schwickerath@cern.ch" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">Ulrich.Schwickerath@cern.ch</a>><br>
                                  <b>Envoyé :</b> lundi 16 janvier 2023
                                  11:38:08<br>
                                  <b>À :</b> <a href="mailto:openstack-discuss@lists.openstack.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">openstack-discuss@lists.openstack.org</a><br>
                                  <b>Objet :</b> Re: 答复: Experience with
                                  VGPUs</font>
                                <div> </div>
                              </div>
                              <div>
                                <p>Hi, all,</p>
                                <p>just to add to the discussion, at
                                  CERN we have recently deployed a bunch
                                  of A100 GPUs in PCI passthrough mode,
                                  and are now looking into improving
                                  their usage by using MIG. From the
                                  NOVA point of view things seem to work
                                  OK, we can schedule VMs requesting a
                                  VGPU, the client starts up and gets a
                                  license token from our NVIDIA license
                                  server (distributing license keys is
                                  our private cloud is relatively easy
                                  in our case). It's a PoC only for the
                                  time being, and we're not ready to put
                                  that forward as we're facing issues
                                  with CUDA on the client (it fails
                                  immediately in memory operations with
                                  'not supported', still investigating
                                  why this happens). <br>
                                </p>
                                <p>Once we get that working it would be
                                  nice to be able to have a more fine
                                  grained scheduling so that people can
                                  ask for MIG devices of different size.
                                  The other challenge is how to set
                                  limits on GPU resources. Once the
                                  above issues have been sorted out we
                                  may want to look into cyborg as well
                                  thus we are quite interested in first
                                  experiences with this.</p>
                                <p>Kind regards, </p>
                                <p>Ulrich<br>
                                </p>
                                <div>On 13.01.23 21:06, Dmitriy
                                  Rabotyagov wrote:<br>
                                </div>
                                <blockquote type="cite">
                                  <div dir="auto">
                                    <div>To have that said, deb/rpm
                                      packages they are providing
                                      doesn't help much, as:
                                      <div dir="auto">* There is no repo
                                        for them, so you need to
                                        download them manually from
                                        enterprise portal</div>
                                      <div dir="auto">* They can't be
                                        upgraded anyway, as driver
                                        version is part of the package
                                        name. And each package conflicts
                                        with any another one. So you
                                        need to explicitly remove old
                                        package and only then install
                                        new one. And yes, you must stop
                                        all VMs before upgrading driver
                                        and no, you can't live migrate
                                        GPU mdev devices due to that now
                                        being implemented in qemu. So
                                        deb/rpm/generic driver doesn't
                                        matter at the end tbh.</div>
                                      <br>
                                      <br>
                                      <div class="gmail_quote">
                                        <div dir="ltr" class="gmail_attr">пт, 13 янв.
                                          2023 г., 20:56 Cedric <<a href="mailto:yipikai7@gmail.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">yipikai7@gmail.com</a>>:<br>
                                        </div>
                                        <blockquote class="gmail_quote" style="margin:0px 0px 0px
                                          0.8ex;border-left:1px solid
                                          rgb(204,204,204);padding-left:1ex">
                                          <div dir="auto"><br>
                                            Ended up with the very same
                                            conclusions than Dimitry
                                            regarding the use of Nvidia
                                            Vgrid for the VGPU use case
                                            with Nova, it works pretty
                                            well but:<br>
                                            <br>
                                            - respecting the licensing
                                            model as operationnal
                                            constraints, note that
                                            guests need to reach a
                                            license server in order to
                                            get a token (could be via
                                            the Nvidia SaaS service or
                                            on-prem)<br>
                                            - drivers for both guest and
                                            hypervisor are not easy to
                                            implement and maintain on
                                            large scale. A year ago,
                                            hypervisors drivers were not
                                            packaged to Debian/Ubuntu,
                                            but builded though a bash
                                            script, thus requiering
                                            additional automatisation
                                            work and careful attention
                                            regarding kernel
                                            update/reboot of Nova
                                            hypervisors.<br>
                                            <br>
                                            Cheers</div>
                                          <br>
                                          <br>
                                          On Fri, Jan 13, 2023 at 4:21
                                          PM Dmitriy Rabotyagov <<a href="mailto:noonedeadpunk@gmail.com" rel="noreferrer noreferrer
                                            noreferrer noreferrer" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">noonedeadpunk@gmail.com</a>>
                                          wrote:<br>
                                          ><br>
                                          > You are saying that, like
                                          Nvidia GRID drivers are
                                          open-sourced while<br>
                                          > in fact they're super far
                                          from being that. In order to
                                          download<br>
                                          > drivers not only for
                                          hypervisors, but also for
                                          guest VMs you need to<br>
                                          > have an account in their
                                          Enterprise Portal. It took me
                                          roughly 6 weeks<br>
                                          > of discussions with
                                          hardware vendors and Nvidia
                                          support to get a<br>
                                          > proper account there. And
                                          that happened only after
                                          applying for their<br>
                                          > Partner Network (NPN).<br>
                                          > That still doesn't solve
                                          the issue of how to provide
                                          drivers to<br>
                                          > guests, except pre-build
                                          a series of images with these
                                          drivers<br>
                                          > pre-installed (we ended
                                          up with making a DIB element
                                          for that [1]).<br>
                                          > Not saying about the need
                                          to distribute license tokens
                                          for guests and<br>
                                          > the whole mess with
                                          compatibility between
                                          hypervisor and guest drivers<br>
                                          > (as guest driver can't be
                                          newer then host one, and HVs
                                          can't be too<br>
                                          > new either).<br>
                                          ><br>
                                          > It's not that I'm
                                          protecting AMD, but just
                                          saying that Nvidia is not<br>
                                          > that straightforward
                                          either, and at least on paper
                                          AMD vGPUs look<br>
                                          > easier both for operators
                                          and end-users.<br>
                                          ><br>
                                          > [1] <a href="https://github.com/citynetwork/dib-elements/tree/main/nvgrid" rel="noreferrer noreferrer
                                            noreferrer noreferrer
                                            noreferrer" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">
https://github.com/citynetwork/dib-elements/tree/main/nvgrid</a><br>
                                          ><br>
                                          > ><br>
                                          > > As for AMD cards,
                                          AMD stated that some of their
                                          MI series card supports SR-IOV
                                          for vGPUs. However, those
                                          drivers are never open source
                                          or provided closed source to
                                          public, only large cloud
                                          providers are able to get
                                          them. So I don't really
                                          recommend getting AMD cards
                                          for vGPU unless you are able
                                          to get support from them.<br>
                                          > ><br>
                                          ><br>
                                        </blockquote>
                                      </div>
                                    </div>
                                  </div>
                                </blockquote>
                              </div>
                            </div>
                          </blockquote>
                        </div>
                      </div>
                    </blockquote>
                  </div>
                </blockquote>
              </div>
            </div>
          </div>
        </blockquote>
      </div>
    </blockquote>
  </body>
</html>