Hi all,

I'm running OpenStack  2023.2 (Bobcat) deployed with Kolla-Ansible.

Currently I'm struggling to get the NVIDIA drivers to work in a FCOS 35 image. I know FCOS 35 is EOL and outdated, but for a K8s 1.21.11 cluster it is the only version that still works with Magnum.

So far I managed to get the GPU Passthrough working for two NVIDIA Tesla T4 GPUs. I have attached the GPUs to a Ubuntu instance and ran geekbench without any issues.

Then I tried to install the latest NVIDIA drivers in FCOS 35 by running:

sudo rpm-ostree install https://mirrors.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm https://mirrors.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm
systemctl reboot
rpm-ostree install --apply-live akmod-nvidia akmods pciutils rpmdevtools xorg-x11-drv-nvidia-cuda

It fails on:

rpm-ostree install --apply-live akmod-nvidia akmods pciutils rpmdevtools xorg-x11-drv-nvidia-cuda

I guess the binaries are no longer hosted?

Running the same on FCOS 36 and latest FCOS 40 works fine.

Next I have taken new images from the FCOS instances with the NVIDIA driver.

openstack server image create --name fedora_coreos_36_nvidia_cuda fedora_coreos_35_latest_no_update --wait

This takes ages but anyway I can successfully deploy new instances from the images with my gpu flavor and the GPU is working fine. Again tested with geekbench.

But now when adding new K8s worker nodes in a fresh nodegroup on an existing cluster 1.21.11 cluster the nodes are not joining the cluster:

openstack coe nodegroup create \
   --node-count 1 \
   --role worker-gpu \
   --flavor gpuflavor \
   --image fedora_coreos_36_nvidia_cuda \
   $CLUSTER_ID worker-gpu

But I know from the past that FCOS 35 was the only version that was working fine with K8s 1.21.11. So I tried with FCOS 35 and this works fine, but unfortunately without the NVIDIA drivers.

So maybe the problem is related to this and I need to get FCOS 35 working? Or a newer FCOS version?

I'm also not sure if this is the best way of doing this. Maybe there is already a ready FCOS image with nvidia drivers?

I can find lots of outdated information on the internet.

Generally I tend to use Vexxhost CAPI from now on, which works great.

This is really just a one off attempt to have a working NVIDIA T4 GPU in a K8s 1.21.11 cluster.

Best Regards,
Oliver