Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get NVIDIA support working #621

Open
PidgeyBE opened this issue Nov 21, 2024 · 2 comments
Open

Can't get NVIDIA support working #621

PidgeyBE opened this issue Nov 21, 2024 · 2 comments

Comments

@PidgeyBE
Copy link

Hi,

So I've fully installed a hardware device using an ISO built via EIB.
I've followed all the steps on https://documentation.suse.com/suse-edge/3.1/html/edge/id-nvidia-gpus-on-sle-micro.html#id-bringing-it-together-via-edge-image-builder
Although I had to add compatWithCPUManager: true to kubernetes/helm/values/nvidia-device-plugin.yaml to get the device plugin working:

nvidia-device-plugin   nvidia-device-plugin-hfgl9                    1/1     Running     0          129m

Inside the device plugin pod I can run nvidia-smi.

Now the weird stuff: I cannot run nvidia-smi in any other pod deployed on k3s.
I can run podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable -it registry.suse.com/bci/bci-base:latest bash and there nvidia-smi works.
But if I deploy the same image in a pod on k3s, nvidia-smi doesnt work.

To make sure I dont miss any settings, I tried to expose as many privileges etc as possible:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test-pod-privileged
  labels:
    app: gpu-test
spec:
  containers:
  - name: gpu-test-container
    image: registry.suse.com/bci/bci-base:latest
    command: ["/bin/bash", "-c", "--"]
    args: ["while true; do sleep 30; done;"]
    securityContext:
      privileged: true # Full access to the host
      capabilities:
        add:
          - ALL # Grant all Linux capabilities
      allowPrivilegeEscalation: true # Allow privilege escalation inside the container
      runAsUser: 0 # Run the container as root
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: "all"
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: "all"
    resources:
      limits:
        nvidia.com/gpu: 1
  restartPolicy: Never

But even in this pod, nvidia-smi does not work... It always yields nvidia-smi: command not found

Can anybody guide me in which direction I should continue to dig?

Thanks, Pj

@PidgeyBE
Copy link
Author

After hours of digging I found that

default-runtime: nvidia

has to be added to kubernetes/config/server.yaml as well.

Still, nvidia-smi wont work and throw

Failed to initialize NVML: Insufficient Permissions

-> To make it fully work, I have to extend the Pod spec with either

    securityContext:
      seLinuxOptions:
        type: spc_t

or

    securityContext:
      privileged: true

Is there some other undocumented flag I have to set to avoid having to extend the securityContext of every k8s pod?

@PidgeyBE
Copy link
Author

PidgeyBE commented Dec 2, 2024

I tried with an older version (nvidia-container-toolkit=1.14.6-1) as well, but the issue remains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant