Experience with VGPUs (Tobias Urdin)

Karl Kloppenborg kkloppenborg at rwts.com.au
Tue Jan 17 10:18:16 UTC 2023


Hi Tobias,

I saw your message, interesting method to get around the transient mdev issue.
Have you looked into implementing cyborg as a method to alleviate this? We are currently assessing it for a different project using nvidia A40’s.

Would be keen to swap war stories and see if we can make a better solution than the current vGPU mdev support going on.


Kind Regards,
Karl.
You can book a 30-minute meeting with me by clicking this link.<https://calendly.com/karl_rwts/30min>
--
Karl Kloppenborg, Systems Engineering (BCompSc, CNCF-[KCNA, CKA, CKAD], LFCE, CompTIA Linux+ XK0-004)
Real World Technology Solutions - IT People you can trust
Voice | Data | IT Procurement | Managed IT
rwts.com.au<http://rwts.com.au> | 1300 798 718

[uc%3fexport=download&id=1M0bR7j1-rXl-e7k5f1Rhwot6K_vfuAvn&revid=0B4fBbZ0cwq-1WFdQSExlR28rOEtUanJjOGcvQnJjMFhEMlEwPQ]
Real World is a DellEMC Gold Partner

This document should be read only by those persons to whom it is addressed and its content is not intended for use by any other persons. If you have received this message in error, please notify us immediately. Please also destroy and delete the message from your computer. Any unauthorised form of reproduction of this message is strictly prohibited. We are not liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt. Please consider the environment before printing this e-mail.


From: openstack-discuss-request at lists.openstack.org <openstack-discuss-request at lists.openstack.org>
Date: Tuesday, 17 January 2023 at 9:06 pm
To: openstack-discuss at lists.openstack.org <openstack-discuss at lists.openstack.org>
Subject: openstack-discuss Digest, Vol 51, Issue 51
Send openstack-discuss mailing list submissions to
        openstack-discuss at lists.openstack.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-discuss

or, via email, send a message with subject or body 'help' to
        openstack-discuss-request at lists.openstack.org

You can reach the person managing the list at
        openstack-discuss-owner at lists.openstack.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of openstack-discuss digest..."


Today's Topics:

   1. Re: Enable fstrim automatically on cinder thin lvm
      provisioning (Rajat Dhasmana)
   2. Re: Experience with VGPUs (Tobias Urdin)
   3. Re: [designate] Proposal to deprecate the agent framework and
      agent based backends (Thomas Goirand)
   4. Re: Experience with VGPUs (Sylvain Bauza)


----------------------------------------------------------------------

Message: 1
Date: Tue, 17 Jan 2023 10:11:27 +0530
From: Rajat Dhasmana <rdhasman at redhat.com>
To: A Monster <amonster369 at gmail.com>
Cc: openstack-discuss <openstack-discuss at lists.openstack.org>
Subject: Re: Enable fstrim automatically on cinder thin lvm
        provisioning
Message-ID:
        <CAARK8KQ3rM9KSQ9vFP+CAVPL_xAksxfXvFuvZj8gd7FeFT6LTw at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

We've a config option 'report_discard_supported'[1] which can be added to
cinder.conf that will enable trim/unmap support.

Also I would like to suggest not creating new openstack-discuss threads for
the same issue and reuse the first one created.
As I can see these are the 3 threads for the same issue[2][3][4].

[1]
https://docs.openstack.org/cinder/latest/configuration/block-storage/config-options.html
[2]
https://lists.openstack.org/pipermail/openstack-discuss/2023-January/031789.html
[3]
https://lists.openstack.org/pipermail/openstack-discuss/2023-January/031797.html
[4]
https://lists.openstack.org/pipermail/openstack-discuss/2023-January/031805.html

Thanks
Rajat Dhasmana

On Tue, Jan 17, 2023 at 8:57 AM A Monster <amonster369 at gmail.com> wrote:

> I deployed openstack using kolla ansible, and used LVM as storage backend
> for my cinder service, however I noticed that the lvm thin pool size keeps
> increasing even though the space used by instances volumes is the same, and
> after a bit of investigating I found out that I had to enable fstrim
> because the data deleted inside the logical volumes was still allocated
> from the thin pool perspective and I had to do fstrim on those volumes,
>
> how can I enable this automatically in openstack?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230117/805f4aba/attachment-0001.htm>

------------------------------

Message: 2
Date: Tue, 17 Jan 2023 08:54:03 +0000
From: Tobias Urdin <tobias.urdin at binero.com>
To: openstack-discuss <openstack-discuss at lists.openstack.org>
Subject: Re: Experience with VGPUs
Message-ID: <220CE3FB-C139-492E-ADD1-BC1ECBEAE65E at binero.com>
Content-Type: text/plain; charset="utf-8"

Hello,

We are using vGPUs with Nova on OpenStack Xena release and we?ve had a fairly good experience integration
NVIDIA A10 GPUs into our cloud.

As we see it there is some painpoints that just goes with mantaining the GPU feature.

- There is a very tight coupling of the NVIDIA driver in the guest (instance) and on the compute node that needs to
  be managed.

- Doing maintainance need more planning i.e powering off instances, NVIDIA driver on compute node needs to be
  rebuilt on hypervisor if kernel is upgraded unless you?ve implemented DKMS for that.

- Because we?ve different flavor of GPU (we split the A10 cards into different flavors for maximum utilization of
  other compute resources) we added custom traits in the Placement service to handle that, handling that with
  a script since doing anything manually related to GPUs you will get confused quickly. [1]

- Since Nova does not handle recreation of mdevs (or use the new libvirt autostart feature for mdevs) we have
  a systemd unit that executes before the nova-compute service that walks all the libvirt domains and does lookups
  in Placement to recreate the mdevs before nova-compute start. [2] [3] [4]

Best regards
Tobias

DISCLAIMER: Below is provided without any warranty of actually working for you or your setup and does
very specific things that we need and is only provided to give you some insight and help. Use at your own risk.

[1] https://paste.opendev.org/show/b6FdfwDHnyJXR0G3XarE/
[2] https://paste.opendev.org/show/bGtO6aIE519uysvytWv0/
[3] https://paste.opendev.org/show/bftOEIPxlpLptkosxlL6/
[4] https://paste.opendev.org/show/bOYBV6lhRON4ntQKYPkb/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230117/daed7e1e/attachment-0001.htm>

------------------------------

Message: 3
Date: Tue, 17 Jan 2023 10:11:44 +0100
From: Thomas Goirand <zigo at debian.org>
To: openstack-discuss <OpenStack-discuss at lists.openstack.org>
Subject: Re: [designate] Proposal to deprecate the agent framework and
        agent based backends
Message-ID: <46a43b97-063d-ed46-6dc1-94f7e0d12e5e at debian.org>
Content-Type: text/plain; charset=UTF-8; format=flowed

On 1/17/23 01:52, Michael Johnson wrote:
> TLDR: The Designate team would like to deprecate the backend agent
> framework and the agent based backends due to lack of development and
> design issues with the current implementation. The following backends
> would be deprecated: Bind9 (Agent), Denominator, Microsoft DNS
> (Agent), Djbdns (Agent), Gdnsd (Agent), and Knot2 (Agent).

Hi Michael,

Thanks for this.

Now, if we're going to get rid of the code soonish, can we just get rid
of the unit tests, rather than attempting to monkey-patch dnspython?
That feels safer, no? With Eventlet, I have the experience that monkey
patching is dangerous and often leads to disaster.

Cheers,

Thomas Goirand (zigo)




------------------------------

Message: 4
Date: Tue, 17 Jan 2023 11:04:59 +0100
From: Sylvain Bauza <sbauza at redhat.com>
To: Tobias Urdin <tobias.urdin at binero.com>
Cc: openstack-discuss <openstack-discuss at lists.openstack.org>
Subject: Re: Experience with VGPUs
Message-ID:
        <CALOCmukWP2qwfh7D8sUotSbhrpqok739s8KcvHmiXEZWT3JSfQ at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Le mar. 17 janv. 2023 ? 10:00, Tobias Urdin <tobias.urdin at binero.com> a
?crit :

> Hello,
>
> We are using vGPUs with Nova on OpenStack Xena release and we?ve had a
> fairly good experience integration
> NVIDIA A10 GPUs into our cloud.
>
>
Great to hear, thanks for your feedback, much appreciated Tobias.


> As we see it there is some painpoints that just goes with mantaining the
> GPU feature.
>
> - There is a very tight coupling of the NVIDIA driver in the guest
> (instance) and on the compute node that needs to
>   be managed.
>
>
As nvidia provides proprietary drivers, there isn't much we can move on
upstream, even for CI testing.
Many participants in this thread explained this as a common concern and I
understand their pain, but yeah you need third-party tooling for managing
both the driver installation and the licensing servers.


> - Doing maintainance need more planning i.e powering off instances, NVIDIA
> driver on compute node needs to be
>   rebuilt on hypervisor if kernel is upgraded unless you?ve implemented
> DKMS for that.
>
>
Ditto, unfortunately I wish the driver could be less kernel-dependent but I
don't see a foreseenable future for this.



> - Because we?ve different flavor of GPU (we split the A10 cards into
> different flavors for maximum utilization of
>   other compute resources) we added custom traits in the Placement service
> to handle that, handling that with
>   a script since doing anything manually related to GPUs you will get
> confused quickly. [1]
>

True, that's why you can also use generic mdevs which will create different
resource classes (but ssssht) or use the placement.yaml file to manage your
inventories.
https://specs.openstack.org/openstack/nova-specs/specs/xena/implemented/generic-mdevs.html


> - Since Nova does not handle recreation of mdevs (or use the new libvirt
> autostart feature for mdevs) we have
>   a systemd unit that executes before the nova-compute service that walks
> all the libvirt domains and does lookups
>   in Placement to recreate the mdevs before nova-compute start. [2] [3] [4]
>
>
This is a known issue and we agreed on the last PTG for a direction.
Patches on review.
https://review.opendev.org/c/openstack/nova/+/864418

Thanks,
-Sylvain


> Best regards
> Tobias
>
> DISCLAIMER: Below is provided without any warranty of actually working for
> you or your setup and does
> very specific things that we need and is only provided to give you some
> insight and help. Use at your own risk.
>
> [1] https://paste.opendev.org/show/b6FdfwDHnyJXR0G3XarE/
> [2] https://paste.opendev.org/show/bGtO6aIE519uysvytWv0/
> [3] https://paste.opendev.org/show/bftOEIPxlpLptkosxlL6/
> [4] https://paste.opendev.org/show/bOYBV6lhRON4ntQKYPkb/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230117/7b9455f2/attachment.htm>

------------------------------

Subject: Digest Footer

_______________________________________________
openstack-discuss mailing list
openstack-discuss at lists.openstack.org


------------------------------

End of openstack-discuss Digest, Vol 51, Issue 51
*************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230117/a4237406/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 21610 bytes
Desc: image001.jpg
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230117/a4237406/attachment-0001.jpg>


More information about the openstack-discuss mailing list