<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Le jeu. 22 juin 2023 à 10:43, Mahendra Paipuri <<a href="mailto:mahendra.paipuri@cnrs.fr">mahendra.paipuri@cnrs.fr</a>> a écrit :<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div>
    <p>Hello all,</p>
    <p>Thanks @Ulrich for sharing the presentation. Very informative!!</p>
    <p>One question : if I understood correctly, <b>time-sliced </b>vGPUs
      <b>absolutely need</b> GRID drivers and licensed clients for the
      vGPUs to work in the guests. For the MIG partitioning, there is <b>no
        need</b> to install GRID drivers in the guest and also <b>no
        need</b> to have licensed clients. Could you confirm if this is
      the actual case?</p></div></blockquote><div><br></div><div>Again, I'm not part of nVidia, neither I'm paid from them, but you can look at their GRID licensing here :</div><div><a href="https://docs.nvidia.com/grid/latest/grid-licensing-user-guide/index.html">https://docs.nvidia.com/grid/latest/grid-licensing-user-guide/index.html</a></div><div><br></div><div>If you also look at the nvidia docs for RHEL support, you need a vCS (virtualComputeServer) licence for Ampere MIG profiles like C-series :</div><div><a href="https://docs.nvidia.com/grid/latest/grid-vgpu-release-notes-red-hat-el-kvm/index.html#hardware-configuration">https://docs.nvidia.com/grid/latest/grid-vgpu-release-notes-red-hat-el-kvm/index.html#hardware-configuration</a></div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>
    <p>Cheers.<br>
    </p>
    <p>Regards</p>
    <p>Mahendra<br>
    </p>
    <div>On 21/06/2023 16:10, Ulrich
      Schwickerath wrote:<br>
    </div>
    <blockquote type="cite">
      
      <p>Hi, all,</p>
      <p>Sylvain explained quite well how to do it technically. We have
        a PoC running, however, still have some stability issues, as
        mentioned on the summit. We're running the NVIDIA virtualisation
        drivers on the hypervisors and the guests, which requires a
        license from NVIDIA. In our configuration we are still quite
        limited in the sense that we have to configure all cards in the
        same hypervisor in the same way, that is the same MIG
        partitioning. Also, it is not possible to attach more than one
        device to a single VM.<br>
      </p>
      <p>As mentioned in the presentation we are a bit behind with Nova,
        and in the process of fixing this as we speak. Because of that
        we had to do a couple of back ports in Nova to make it work,
        which we hope to be able to get rid of by the ongoing upgrades.<br>
      </p>
      <p>Let me  see if I can make the slides available here. <br>
      </p>
      <p>Cheers, Ulrich<br>
      </p>
      <div>On 20/06/2023 19:07, Oliver Weinmann
        wrote:<br>
      </div>
      <blockquote type="cite"> Hi
        everyone,
        <div><br>
        </div>
        <div>Jumping into this topic again. Unfortunately I haven’t had
          time yet to test Nvidia VGPU in OpenStack but in VMware
          Vsphere. What our users complain most about is the
          inflexibility since you have to use the same profile on all
          vms that use the gpu. One user mentioned to try SLURM. I know
          there is no official OpenStack project for SLURM but I wonder
          if anyone else tried this approach? If I understood correctly
          this would also not require any Nvidia subscription since you
          passthrough the GPU to a single instance and you don’t use
          VGPU nor MIG.</div>
        <div><br>
        </div>
        <div>Cheers,</div>
        <div>Oliver<br>
          <br>
          <div dir="ltr">Von meinem iPhone gesendet</div>
          <div dir="ltr"><br>
            <blockquote type="cite">Am 20.06.2023 um 17:34 schrieb
              Sylvain Bauza <a href="mailto:sbauza@redhat.com" target="_blank"><sbauza@redhat.com></a>:<br>
              <br>
            </blockquote>
          </div>
          <blockquote type="cite">
            <div dir="ltr">
              <div dir="ltr">
                <div dir="ltr"><br>
                </div>
                <br>
                <div class="gmail_quote">
                  <div dir="ltr" class="gmail_attr">Le mar. 20 juin 2023
                    à 16:31, Mahendra Paipuri <<a href="mailto:mahendra.paipuri@cnrs.fr" target="_blank">mahendra.paipuri@cnrs.fr</a>>
                    a écrit :<br>
                  </div>
                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                    <div>
                      <p>Thanks Sylvain for the pointers.</p>
                      <p>One of the questions we have is: can we create
                        MIG profiles on the host and then attach each
                        one or more profile(s) to VMs? This bug [1]
                        reports that once we attach one profile to a VM,
                        rest of MIG profiles become unavailable. From
                        what you have said about using SR-IOV and VFs, I
                        guess this should be possible.<br>
                      </p>
                    </div>
                  </blockquote>
                  <div><br>
                  </div>
                  <div>Correct, what you need is to create first the VFs
                    using sriov-manage and then you can create the MIG
                    instances.</div>
                  <div>Once you create the MIG instances using the
                    profiles you want, you will see that the related
                    available_instances for the nvidia mdev type (by
                    looking at sysfs) will say that you can have a
                    single vGPU for this profile.</div>
                  <div>Then, you can use that mdev type with Nova using
                    nova.conf.</div>
                  <div><br>
                  </div>
                  <div>That being said, while this above is simple, the
                    below talk was saying more about how to correctly
                    use the GPU by the host so please wait :-)</div>
                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                    <div>
                      <p> </p>
                      <p>I think you are talking about "vGPUs with
                        OpenStack Nova" talk on OpenInfra stage. I will
                        look into it once the videos will be online. <br>
                      </p>
                    </div>
                  </blockquote>
                  <div><br>
                  </div>
                  <div>Indeed.</div>
                  <div>-S <br>
                  </div>
                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                    <div>
                      <p> </p>
                      <p>[1] <a href="https://bugs.launchpad.net/nova/+bug/2008883" target="_blank">https://bugs.launchpad.net/nova/+bug/2008883</a></p>
                      <p>Thanks</p>
                      <p>Regards</p>
                      <p>Mahendra<br>
                      </p>
                      <div>On 20/06/2023 15:47, Sylvain Bauza wrote:<br>
                      </div>
                      <blockquote type="cite">
                        <div dir="ltr">
                          <div dir="ltr"><br>
                          </div>
                          <br>
                          <div class="gmail_quote">
                            <div dir="ltr" class="gmail_attr">Le mar. 20
                              juin 2023 à 15:12, PAIPURI Mahendra <<a href="mailto:mahendra.paipuri@cnrs.fr" target="_blank">mahendra.paipuri@cnrs.fr</a>>
                              a écrit :<br>
                            </div>
                            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                              <div>
                                <div id="m_4751079946466621058m_7191695452821608857m_2020284182605405898divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Calibri,Helvetica,sans-serif" dir="ltr">
                                  <p>Hello Ulrich,</p>
                                  <p><br>
                                  </p>
                                  <p>I am relaunching this discussion as
                                    I noticed that you gave a talk about
                                    this topic at OpenInfra Summit in
                                    Vancouver. Is it possible to share
                                    the presentation here? I hope the
                                    talks will be uploaded soon in
                                    YouTube. </p>
                                  <p><br>
                                  </p>
                                  <p>We are mainly interested in using
                                    MIG instances in Openstack cloud and
                                    I could not really find a lot of
                                    information by googling. If you
                                    could share your experiences, that
                                    would be great.</p>
                                  <p><br>
                                  </p>
                                </div>
                              </div>
                            </blockquote>
                            <div><br>
                            </div>
                            <div>Due to scheduling conflicts, I wasn't
                              able to attend Ulrich's session but his
                              feedback will be greatly listened to by
                              me.</div>
                            <div><br>
                            </div>
                            <div>FWIW, there was also a short session
                              about how to enable MIG and play with Nova
                              at the OpenInfra stage (and that one I was
                              able to attend it), and it was quite
                              seamless. What exact information are you
                              looking for ?</div>
                            <div>The idea with MIG is that you need to
                              create SRIOV VFs above the MIG instances
                              using sriov-manage script provided by
                              nvidia so that the mediated devices will
                              use those VFs as the base PCI devices to
                              be used for Nova.</div>
                            <div><br>
                            </div>
                            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                              <div>
                                <div id="m_4751079946466621058m_7191695452821608857m_2020284182605405898divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Calibri,Helvetica,sans-serif" dir="ltr">
                                  <p> </p>
                                  <p>Cheers.</p>
                                  <p><br>
                                  </p>
                                  <p>Regards</p>
                                  <p>Mahendra</p>
                                </div>
                                <hr style="display:inline-block;width:98%">
                                <div id="m_4751079946466621058m_7191695452821608857m_2020284182605405898divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>De :</b> Ulrich
                                    Schwickerath <<a href="mailto:Ulrich.Schwickerath@cern.ch" target="_blank">Ulrich.Schwickerath@cern.ch</a>><br>
                                    <b>Envoyé :</b> lundi 16 janvier
                                    2023 11:38:08<br>
                                    <b>À :</b> <a href="mailto:openstack-discuss@lists.openstack.org" target="_blank">openstack-discuss@lists.openstack.org</a><br>
                                    <b>Objet :</b> Re: 答复: Experience
                                    with VGPUs</font>
                                  <div> </div>
                                </div>
                                <div>
                                  <p>Hi, all,</p>
                                  <p>just to add to the discussion, at
                                    CERN we have recently deployed a
                                    bunch of A100 GPUs in PCI
                                    passthrough mode, and are now
                                    looking into improving their usage
                                    by using MIG. From the NOVA point of
                                    view things seem to work OK, we can
                                    schedule VMs requesting a VGPU, the
                                    client starts up and gets a license
                                    token from our NVIDIA license server
                                    (distributing license keys is our
                                    private cloud is relatively easy in
                                    our case). It's a PoC only for the
                                    time being, and we're not ready to
                                    put that forward as we're facing
                                    issues with CUDA on the client (it
                                    fails immediately in memory
                                    operations with 'not supported',
                                    still investigating why this
                                    happens). <br>
                                  </p>
                                  <p>Once we get that working it would
                                    be nice to be able to have a more
                                    fine grained scheduling so that
                                    people can ask for MIG devices of
                                    different size. The other challenge
                                    is how to set limits on GPU
                                    resources. Once the above issues
                                    have been sorted out we may want to
                                    look into cyborg as well thus we are
                                    quite interested in first
                                    experiences with this.</p>
                                  <p>Kind regards, </p>
                                  <p>Ulrich<br>
                                  </p>
                                  <div>On 13.01.23 21:06, Dmitriy
                                    Rabotyagov wrote:<br>
                                  </div>
                                  <blockquote type="cite">
                                    <div dir="auto">
                                      <div>To have that said, deb/rpm
                                        packages they are providing
                                        doesn't help much, as:
                                        <div dir="auto">* There is no
                                          repo for them, so you need to
                                          download them manually from
                                          enterprise portal</div>
                                        <div dir="auto">* They can't be
                                          upgraded anyway, as driver
                                          version is part of the package
                                          name. And each package
                                          conflicts with any another
                                          one. So you need to explicitly
                                          remove old package and only
                                          then install new one. And yes,
                                          you must stop all VMs before
                                          upgrading driver and no, you
                                          can't live migrate GPU mdev
                                          devices due to that now being
                                          implemented in qemu. So
                                          deb/rpm/generic driver doesn't
                                          matter at the end tbh.</div>
                                        <br>
                                        <br>
                                        <div class="gmail_quote">
                                          <div dir="ltr" class="gmail_attr">пт, 13
                                            янв. 2023 г., 20:56 Cedric
                                            <<a href="mailto:yipikai7@gmail.com" target="_blank">yipikai7@gmail.com</a>>:<br>
                                          </div>
                                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                            <div dir="auto"><br>
                                              Ended up with the very
                                              same conclusions than
                                              Dimitry regarding the use
                                              of Nvidia Vgrid for the
                                              VGPU use case with Nova,
                                              it works pretty well but:<br>
                                              <br>
                                              - respecting the licensing
                                              model as operationnal
                                              constraints, note that
                                              guests need to reach a
                                              license server in order to
                                              get a token (could be via
                                              the Nvidia SaaS service or
                                              on-prem)<br>
                                              - drivers for both guest
                                              and hypervisor are not
                                              easy to implement and
                                              maintain on large scale. A
                                              year ago, hypervisors
                                              drivers were not packaged
                                              to Debian/Ubuntu, but
                                              builded though a bash
                                              script, thus requiering
                                              additional automatisation
                                              work and careful attention
                                              regarding kernel
                                              update/reboot of Nova
                                              hypervisors.<br>
                                              <br>
                                              Cheers</div>
                                            <br>
                                            <br>
                                            On Fri, Jan 13, 2023 at 4:21
                                            PM Dmitriy Rabotyagov <<a href="mailto:noonedeadpunk@gmail.com" rel="noreferrer noreferrer
                                              noreferrer noreferrer" target="_blank">noonedeadpunk@gmail.com</a>>
                                            wrote:<br>
                                            ><br>
                                            > You are saying that,
                                            like Nvidia GRID drivers are
                                            open-sourced while<br>
                                            > in fact they're super
                                            far from being that. In
                                            order to download<br>
                                            > drivers not only for
                                            hypervisors, but also for
                                            guest VMs you need to<br>
                                            > have an account in
                                            their Enterprise Portal. It
                                            took me roughly 6 weeks<br>
                                            > of discussions with
                                            hardware vendors and Nvidia
                                            support to get a<br>
                                            > proper account there.
                                            And that happened only after
                                            applying for their<br>
                                            > Partner Network (NPN).<br>
                                            > That still doesn't
                                            solve the issue of how to
                                            provide drivers to<br>
                                            > guests, except
                                            pre-build a series of images
                                            with these drivers<br>
                                            > pre-installed (we ended
                                            up with making a DIB element
                                            for that [1]).<br>
                                            > Not saying about the
                                            need to distribute license
                                            tokens for guests and<br>
                                            > the whole mess with
                                            compatibility between
                                            hypervisor and guest drivers<br>
                                            > (as guest driver can't
                                            be newer then host one, and
                                            HVs can't be too<br>
                                            > new either).<br>
                                            ><br>
                                            > It's not that I'm
                                            protecting AMD, but just
                                            saying that Nvidia is not<br>
                                            > that straightforward
                                            either, and at least on
                                            paper AMD vGPUs look<br>
                                            > easier both for
                                            operators and end-users.<br>
                                            ><br>
                                            > [1] <a href="https://github.com/citynetwork/dib-elements/tree/main/nvgrid" rel="noreferrer noreferrer
                                              noreferrer noreferrer
                                              noreferrer" target="_blank">
https://github.com/citynetwork/dib-elements/tree/main/nvgrid</a><br>
                                            ><br>
                                            > ><br>
                                            > > As for AMD cards,
                                            AMD stated that some of
                                            their MI series card
                                            supports SR-IOV for vGPUs.
                                            However, those drivers are
                                            never open source or
                                            provided closed source to
                                            public, only large cloud
                                            providers are able to get
                                            them. So I don't really
                                            recommend getting AMD cards
                                            for vGPU unless you are able
                                            to get support from them.<br>
                                            > ><br>
                                            ><br>
                                          </blockquote>
                                        </div>
                                      </div>
                                    </div>
                                  </blockquote>
                                </div>
                              </div>
                            </blockquote>
                          </div>
                        </div>
                      </blockquote>
                    </div>
                  </blockquote>
                </div>
              </div>
            </div>
          </blockquote>
        </div>
      </blockquote>
    </blockquote>
  </div>

</blockquote></div></div>