Can't get NVIDIA support working #621

PidgeyBE · 2024-11-21T20:35:22Z

Hi,

So I've fully installed a hardware device using an ISO built via EIB.
I've followed all the steps on https://documentation.suse.com/suse-edge/3.1/html/edge/id-nvidia-gpus-on-sle-micro.html#id-bringing-it-together-via-edge-image-builder
Although I had to add compatWithCPUManager: true to kubernetes/helm/values/nvidia-device-plugin.yaml to get the device plugin working:

nvidia-device-plugin   nvidia-device-plugin-hfgl9                    1/1     Running     0          129m

Inside the device plugin pod I can run nvidia-smi.

Now the weird stuff: I cannot run nvidia-smi in any other pod deployed on k3s.
I can run podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable -it registry.suse.com/bci/bci-base:latest bash and there nvidia-smi works.
But if I deploy the same image in a pod on k3s, nvidia-smi doesnt work.

To make sure I dont miss any settings, I tried to expose as many privileges etc as possible:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test-pod-privileged
  labels:
    app: gpu-test
spec:
  containers:
  - name: gpu-test-container
    image: registry.suse.com/bci/bci-base:latest
    command: ["/bin/bash", "-c", "--"]
    args: ["while true; do sleep 30; done;"]
    securityContext:
      privileged: true # Full access to the host
      capabilities:
        add:
          - ALL # Grant all Linux capabilities
      allowPrivilegeEscalation: true # Allow privilege escalation inside the container
      runAsUser: 0 # Run the container as root
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: "all"
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: "all"
    resources:
      limits:
        nvidia.com/gpu: 1
  restartPolicy: Never

But even in this pod, nvidia-smi does not work... It always yields nvidia-smi: command not found

Can anybody guide me in which direction I should continue to dig?

Thanks, Pj

The text was updated successfully, but these errors were encountered:

PidgeyBE · 2024-11-22T09:04:09Z

After hours of digging I found that

default-runtime: nvidia

has to be added to kubernetes/config/server.yaml as well.

Still, nvidia-smi wont work and throw

Failed to initialize NVML: Insufficient Permissions

-> To make it fully work, I have to extend the Pod spec with either

    securityContext:
      seLinuxOptions:
        type: spc_t

or

    securityContext:
      privileged: true

Is there some other undocumented flag I have to set to avoid having to extend the securityContext of every k8s pod?

PidgeyBE · 2024-12-02T13:30:01Z

I tried with an older version (nvidia-container-toolkit=1.14.6-1) as well, but the issue remains.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't get NVIDIA support working #621

Can't get NVIDIA support working #621

PidgeyBE commented Nov 21, 2024

PidgeyBE commented Nov 22, 2024

PidgeyBE commented Dec 2, 2024

Can't get NVIDIA support working #621

Can't get NVIDIA support working #621

Comments

PidgeyBE commented Nov 21, 2024

PidgeyBE commented Nov 22, 2024

PidgeyBE commented Dec 2, 2024