Hi all,
I'm running OpenStack 2023.2 (Bobcat) deployed with Kolla-Ansible.
Currently I'm struggling to get the NVIDIA drivers to work in a FCOS 35 image. I know FCOS 35 is EOL and outdated, but for a K8s 1.21.11 cluster it is the only version that still works with Magnum.
So far I managed to get the GPU Passthrough working for two NVIDIA Tesla T4 GPUs. I have attached the GPUs to an Ubuntu instance and ran geekbench without any issues.
Then I tried to install the latest NVIDIA drivers in FCOS 35 by running:
systemctl reboot
rpm-ostree install --apply-live akmod-nvidia akmods pciutils rpmdevtools xorg-x11-drv-nvidia-cuda
It fails on:
rpm-ostree install --apply-live akmod-nvidia akmods pciutils rpmdevtools xorg-x11-drv-nvidia-cuda
I guess the binaries are no longer hosted?
Running the same on FCOS 36 and latest FCOS 40 works fine.
Next I have taken new images from the FCOS instances with the NVIDIA driver.
openstack server image create --name fedora_coreos_36_nvidia_cuda fedora_coreos_35_latest_no_update --wait
This takes ages but anyway I can successfully deploy new instances from the images with my gpu flavor and the GPU is working fine. Again tested with geekbench.
But now when adding new K8s worker nodes in a fresh nodegroup on an existing K8s 1.21.11 cluster the nodes are not joining the cluster:
openstack coe nodegroup create \
--node-count 1 \
--role worker-gpu \
--flavor gpuflavor \
--image fedora_coreos_36_nvidia_cuda \
$CLUSTER_ID worker-gpu
I have tested the latest available images of FCOS36 and they work fine, so something must be wrong with how I created the new image. Is there some magic that needs to be done?
I'm also not sure if this is the best way of doing this. Maybe there is already a ready FCOS image with nvidia drivers?
I can find lots of outdated information on the internet.
Generally I tend to use Vexxhost CAPI from now on, which works great.
This is really just a one off attempt to have a working NVIDIA T4 GPU in a K8s 1.21.11 cluster.
Best Regards,
Oliver