Hi Pawel,

I was afraid to hear that. I just don’t want to spend too much time on learning FCOS because I clearly see that CAPI is the best way forward.

My approach to get NVIDIA driver working in FCOS is quite simple.

I deploy NVIDIA gpu Operator using Helm without driver:

helm install --wait --generate-name 
-n gpu-operator --create-namespace 
nvidia/gpu-operator 
--set driver.enabled=false 
--version=v22.9.0 
--set toolkit.version=v1.11.0-ubi8


I add a new node group to my existing K8s cluster with one worker node e.g latest FCOS image and a flavor that has a GPU.

Next I assign a Floating IP to the new worker node and ssh to it.

Then I just follow the steps to install the rpmfusion.org repo, reboot and then install the driver.

After that I take an image of the worker node:

OpenStack server image create

And then I try to deploy a node group using this new image but it just doesn’t work. I already checked the worker heat logs but could find any clue why the node is no joining the cluster. To be honest, I can’t even remember if heat actually deploys the new worker node using the modified image. Seems there is some mechanism in FCOS that can only run once and I would need to test this. Or maybe I’m completely off track and need to use the assembler.

I’m currently also looking into getting the gpu Operator working with a working NVIDIA driver container. I made some good progress yesterday. To me it seems that the existing code just needs some fixing for the download links for the rpm’s.

Cheers,
Oliver

Von meinem iPhone gesendet

Am 13.08.2024 um 09:51 schrieb pawel.kubica@comarch.com:

Hi Greg,

Based on my personal tests Magnum CAPI Helm driver requires 2023.1 (I didn't fully test Magnum CAPI driver yet).
I manage to run Magnum CAPI Helm driver on Wallaby but this require one little fix in Magnum code (regarding cluster certificates creation).

Kind regards