[MAGNUM] Fedora Core OS include NVIDIA GPU drivers
Hi all, I'm running OpenStack 2023.2 (Bobcat) deployed with Kolla-Ansible. Currently I'm struggling to get the NVIDIA drivers to work in a FCOS 35 image. I know FCOS 35 is EOL and outdated, but for a K8s 1.21.11 cluster it is the only version that still works with Magnum. So far I managed to get the GPU Passthrough working for two NVIDIA Tesla T4 GPUs. I have attached the GPUs to an Ubuntu instance and ran geekbench without any issues. Then I tried to install the latest NVIDIA drivers in FCOS 35 by running: sudo rpm-ostree install https://mirrors.rpmfusion.org/free/fedora/rpmfusion-free-release-$ (rpm -E %fedora).noarch.rpm https://mirrors.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$ (rpm -E %fedora).noarch.rpm systemctl reboot rpm-ostree install --apply-live akmod-nvidia akmods pciutils rpmdevtools xorg-x11-drv-nvidia-cuda It fails on: rpm-ostree install --apply-live akmod-nvidia akmods pciutils rpmdevtools xorg-x11-drv-nvidia-cuda error: Cannot download rpm-build-libs-4.17.0-4.fc35.x86_64.rpm: All mirrors were tried; Last error: Status code: 403 for https://fedoraproject-updates-archive.fedoraproject.org/fedora/35/x86_64/rpm... I guess the binaries are no longer hosted? Running the same on FCOS 36 and latest FCOS 40 works fine. Next I have taken new images from the FCOS instances with the NVIDIA driver. openstack server image create --name fedora_coreos_36_nvidia_cuda fedora_coreos_35_latest_no_update --wait This takes ages but anyway I can successfully deploy new instances from the images with my gpu flavor and the GPU is working fine. Again tested with geekbench. But now when adding new K8s worker nodes in a fresh nodegroup on an existing K8s 1.21.11 cluster the nodes are not joining the cluster: openstack coe nodegroup create \ --node-count 1 \ --role worker-gpu \ --flavor gpuflavor \ --image fedora_coreos_36_nvidia_cuda \ $CLUSTER_ID worker-gpu I have tested the latest available images of FCOS36 and they work fine, so something must be wrong with how I created the new image. Is there some magic that needs to be done? I'm also not sure if this is the best way of doing this. Maybe there is already a ready FCOS image with nvidia drivers? I can find lots of outdated information on the internet. Generally I tend to use Vexxhost CAPI from now on, which works great. This is really just a one off attempt to have a working NVIDIA T4 GPU in a K8s 1.21.11 cluster. Best Regards, Oliver
Hi Oliver, On 24/7/2024 1:05 am, Oliver Weinmann wrote:
Hi all,
I'm running OpenStack 2023.2 (Bobcat) deployed with Kolla-Ansible.
Currently I'm struggling to get the NVIDIA drivers to work in a FCOS 35 image. I know FCOS 35 is EOL and outdated, but for a K8s 1.21.11 cluster it is the only version that still works with Magnum.
For Bobcat you can run v1.26.8 and FCOS38. Have you tried them? [1] https://docs.openstack.org/magnum/latest/user/index.html#supported-versions - Jake
Hi Jake, Thanks for the hint. I’m using vexxhost capi driver for newer k8s releases. This works brilliant. But in this case we need to stick with the old k8s. I managed to get the NVIDIA driver working in fcos36 but as soon as I create a new image from the instance and try to use it for a new node group in the k8s cluster, the new worker node is never joined to the cluster. I’m not familiar with docs but it seems that after adding the driver I need to reseal the instance to make it usable? Cheers, Oliver Von meinem iPhone gesendet
Am 24.07.2024 um 08:39 schrieb Jake Yip <jake.yip@ardc.edu.au>:
Hi Oliver,
On 24/7/2024 1:05 am, Oliver Weinmann wrote: Hi all, I'm running OpenStack 2023.2 (Bobcat) deployed with Kolla-Ansible. Currently I'm struggling to get the NVIDIA drivers to work in a FCOS 35 image. I know FCOS 35 is EOL and outdated, but for a K8s 1.21.11 cluster it is the only version that still works with Magnum.
For Bobcat you can run v1.26.8 and FCOS38. Have you tried them?
[1] https://docs.openstack.org/magnum/latest/user/index.html#supported-versions
- Jake
On 24/7/2024 9:53 pm, Oliver Weinmann wrote:
I managed to get the NVIDIA driver working in fcos36 but as soon as I create a new image from the instance and try to use it for a new node group in the k8s cluster, the new worker node is never joined to the cluster. I’m not familiar with docs but it seems that after adding the driver I need to reseal the instance to make it usable?
I think this may be due to cloud-init and heat having ran once in your instance/image and it is not running again when you create new instances from that image. You may need to clear out the necessary files. Regards, Jake
Hi Jake, I have the same feeling. I will keep digging. Cloud-init for sure. Von meinem iPhone gesendet
Am 25.07.2024 um 02:27 schrieb Jake Yip <jake.yip@ardc.edu.au>:
On 24/7/2024 9:53 pm, Oliver Weinmann wrote:
I managed to get the NVIDIA driver working in fcos36 but as soon as I create a new image from the instance and try to use it for a new node group in the k8s cluster, the new worker node is never joined to the cluster. I’m not familiar with docs but it seems that after adding the driver I need to reseal the instance to make it usable?
I think this may be due to cloud-init and heat having ran once in your instance/image and it is not running again when you create new instances from that image. You may need to clear out the necessary files.
Regards, Jake
I mange to run Nvidia GPU on FCOS 35 using this container images: https://hub.docker.com/r/fifofonix/driver Kind regards Paweł
Hi Pawel, Thanks for the hint. Funny because I just started again to look into this issue and then saw your reply. Do you also use Magnum? Cheers, Oliver Von meinem iPhone gesendet
Am 07.08.2024 um 16:28 schrieb pawel.kubica@comarch.com:
I mange to run Nvidia GPU on FCOS 35 using this container images: https://hub.docker.com/r/fifofonix/driver
Kind regards Paweł
Hi Olivier, Yes, I'm currently running Magnum service (Heat driver) on Openstack Ussuri, Wallaby and 2023.1. In near future I'm planning to migrate to Magnum CAPI driver. Kind regards
Hi Pawel, How do you automate the deployment of the driver? I managed to get the driver working by manually installing it in the fedora core instance that I add as a node group to an existing K8s cluster. But I can’t seem to manage to install the driver, create a new image from the instance and use this as a template for my node groups. I also tried to get the NVIDIA gpu operator working but this wasn’t working since it doesn’t support fedora core os. Cheers, Oliver Von meinem iPhone gesendet
Am 07.08.2024 um 21:07 schrieb pawel.kubica@comarch.com:
Hi Olivier,
Yes, I'm currently running Magnum service (Heat driver) on Openstack Ussuri, Wallaby and 2023.1. In near future I'm planning to migrate to Magnum CAPI driver.
Kind regards
Hi Olivier, To automate the deployment of the driver I'm using custom FCOS images with additional package (nvidia-container-toolkit) and extended Magnum Heat templates (additional scripts) that: - label GPU nodes with nvidia.com/gpu=present - install container image with additional kernel modules for Nvidia GPU (https://hub.docker.com/r/fifofonix/driver) - reconfigure container runtime - install nvidia-device-plugin (https://nvidia.github.io/k8s-device-plugin) I didn't try to use NVIDIA gpu operator yet (but I heard that support for FCOS is not working properly). Kind regards
Hi Paweł, Thanks for the info. How do you customise the FCOS image? I mean I managed to install the drivers from rpmfusion.org in FCOS37 and they work just fine. I even managed to get the GPU Operator working. But I just can’t install the drivers in FCOS37 take an image e.g. openstack server image create. Whenever I try to use the modified image with Magnum the deployment is just stuck. I believe there is some sort of mechanism in FCOS similar to cloudinit that can only run once and I would need to reset it? Or do you really build a custom image from scratch? Best Regards, Oliver
On 12. Aug 2024, at 17:25, pawel.kubica@comarch.com wrote:
Hi Olivier,
To automate the deployment of the driver I'm using custom FCOS images with additional package (nvidia-container-toolkit) and extended Magnum Heat templates (additional scripts) that: - label GPU nodes with nvidia.com/gpu=present - install container image with additional kernel modules for Nvidia GPU (https://hub.docker.com/r/fifofonix/driver) - reconfigure container runtime - install nvidia-device-plugin (https://nvidia.github.io/k8s-device-plugin)
I didn't try to use NVIDIA gpu operator yet (but I heard that support for FCOS is not working properly).
Kind regards
Hi Olivier, I'm using below project to build customized FCOS images for Openstack: https://github.com/coreos/coreos-assembler/blob/main/docs/building-fcos.md Did your modified image used with Magnum boot correctly - you can use ssh to login into VM or there is some problems with Magnum deployment via Heat Container Agent? What packages/drivers are you installing on FCOS from rpmfusion.org and how is your values.yaml for NVIDIA GPU Operator looks like? Kind regards
On 8/8/24 03:06, pawel.kubica@comarch.com wrote:
Yes, I'm currently running Magnum service (Heat driver) on Openstack Ussuri, Wallaby and 2023.1. In near future I'm planning to migrate to Magnum CAPI driver.
A side question: Would you recommend ensuring one is running a certain release of OpenStack before embarking on using Cluster API? We have made it as far as Wallaby and plan to keep upgrading, but also want to try CAPI at some point. Thanks, Greg.
Hi Greg, Based on my personal tests Magnum CAPI Helm driver requires 2023.1 (I didn't fully test Magnum CAPI driver yet). I manage to run Magnum CAPI Helm driver on Wallaby but this require one little fix in Magnum code (regarding cluster certificates creation). Kind regards
Hi Pawel, I was afraid to hear that. I just don’t want to spend too much time on learning FCOS because I clearly see that CAPI is the best way forward. My approach to get NVIDIA driver working in FCOS is quite simple. I deploy NVIDIA gpu Operator using Helm without driver: helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.enabled=false --version=v22.9.0 --set toolkit.version=v1.11.0-ubi8 I add a new node group to my existing K8s cluster with one worker node e.g latest FCOS image and a flavor that has a GPU. Next I assign a Floating IP to the new worker node and ssh to it. Then I just follow the steps to install the rpmfusion.org repo, reboot and then install the driver. After that I take an image of the worker node: OpenStack server image create And then I try to deploy a node group using this new image but it just doesn’t work. I already checked the worker heat logs but could find any clue why the node is no joining the cluster. To be honest, I can’t even remember if heat actually deploys the new worker node using the modified image. Seems there is some mechanism in FCOS that can only run once and I would need to test this. Or maybe I’m completely off track and need to use the assembler. I’m currently also looking into getting the gpu Operator working with a working NVIDIA driver container. I made some good progress yesterday. To me it seems that the existing code just needs some fixing for the download links for the rpm’s. Cheers, Oliver Von meinem iPhone gesendet
Am 13.08.2024 um 09:51 schrieb pawel.kubica@comarch.com:
Hi Greg,
Based on my personal tests Magnum CAPI Helm driver requires 2023.1 (I didn't fully test Magnum CAPI driver yet). I manage to run Magnum CAPI Helm driver on Wallaby but this require one little fix in Magnum code (regarding cluster certificates creation).
Kind regards
Hi Paweł, So I read a lot about COSA (Fedora Core OS Assembler) and I pretty quickly managed to build modified images with new packages. But I still fail to include the rpmfusion Nvidia driver. Currently I’m stuck here: FCOS38 akmods.post: Created symlink /etc/systemd/system/multi-user.target.wants/akmods.service → /usr/lib/systemd/system/akmods.service. ⠄ Running post scripts... akmod-nvidia akmod-nvidia.post: Building /usr/src/akmods/nvidia-kmod-550.78-1.fc38.src.rpm for kernel 6.8.9-100.fc38.x86_64 akmod-nvidia.post: warning: user mockbuild does not exist - using root akmod-nvidia.post: warning: group mock does not exist - using root akmod-nvidia.post: warning: user mockbuild does not exist - using root akmod-nvidia.post: warning: group mock does not exist - using root akmod-nvidia.post: warning: user mockbuild does not exist - using root akmod-nvidia.post: warning: group mock does not exist - using root akmod-nvidia.post: Installing /usr/src/akmods/nvidia-kmod-550.78-1.fc38.src.rpm akmod-nvidia.post: Building target platforms: x86_64 akmod-nvidia.post: Building for target x86_64 akmod-nvidia.post: setting SOURCE_DATE_EPOCH=1714089600 akmod-nvidia.post: warning: Could not canonicalize hostname: e7722bb66786 akmod-nvidia.post: error: Failed build dependencies: akmod-nvidia.post: /usr/bin/kmodtool is needed by nvidia-kmod-3:550.78-1.fc38.x86_64 akmod-nvidia.post: gcc is needed by nvidia-kmod-3:550.78-1.fc38.x86_64 akmod-nvidia.post: kernel-devel-uname-r = 6.8.9-100.fc38.x86_64 is needed by nvidia-kmod-3:550.78-1.fc38.x86_64 akmod-nvidia.post: xorg-x11-drv-nvidia-kmodsrc = 3:550.78 is needed by nvidia-kmod-3:550.78-1.fc38.x86_64 akmod-nvidia.post: akmod-nvidia.post: RPM build warnings: akmod-nvidia.post: user mockbuild does not exist - using root akmod-nvidia.post: group mock does not exist - using root akmod-nvidia.post: user mockbuild does not exist - using root akmod-nvidia.post: group mock does not exist - using root akmod-nvidia.post: user mockbuild does not exist - using root akmod-nvidia.post: group mock does not exist - using root akmod-nvidia.post: Could not canonicalize hostname: e7722bb66786 Running post scripts... done error: Running %post for akmod-nvidia: bwrap(/bin/sh): Child process killed by signal 1 failed to execute cmd-build: exit status 1 Would you mind sharing your manifest.yaml? Mine looks like this: [coreos-assembler]$ cat src/config/manifest.yaml variables: stream: stable prod: true releasever: 38 packages: - gcc - kernel-devel - kernel-headers - make - dkms - acpid - libglvnd-glx - libglvnd-opengl - libglvnd-devel - pkgconfig #- kmodtool - akmod-nvidia - akmods - pciutils #- rpmdevtools - xorg-x11-drv-nvidia-cuda repos: # These repos are there to make it easier to add new packages to the OS and to # use `cosa fetch --update-lockfile`; but note that all package versions are # still pinned. These repos are also used by the remove-graduated-overrides # GitHub Action. - fedora-archive #- fedora-updates - fedora-archive-updates - rpmfusion-free-updates-testing - rpmfusion-free-updates - rpmfusion-free - rpmfusion-nonfree-updates-testing - rpmfusion-nonfree-updates - rpmfusion-nonfree include: manifests/fedora-coreos.yaml It is strange that it complains about missing gcc, because that is installed for sure. I checked it using cosa run. I didn’t manage to get the GPU container working yet. It fails to build the driver. I need some more time to debug on this. Cheers, Oliver
On 13. Aug 2024, at 09:50, pawel.kubica@comarch.com wrote:
Hi Greg,
Based on my personal tests Magnum CAPI Helm driver requires 2023.1 (I didn't fully test Magnum CAPI driver yet). I manage to run Magnum CAPI Helm driver on Wallaby but this require one little fix in Magnum code (regarding cluster certificates creation).
Kind regards
participants (5)
-
Gregory Orange
-
Jake Yip
-
Oliver Weinmann
-
Oliver Weinmann
-
pawel.kubica@comarch.com