You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inside the device plugin pod I can run nvidia-smi.
Now the weird stuff: I cannot run nvidia-smi in any other pod deployed on k3s.
I can run podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable -it registry.suse.com/bci/bci-base:latest bash and there nvidia-smi works.
But if I deploy the same image in a pod on k3s, nvidia-smi doesnt work.
To make sure I dont miss any settings, I tried to expose as many privileges etc as possible:
apiVersion: v1
kind: Pod
metadata:
name: gpu-test-pod-privileged
labels:
app: gpu-test
spec:
containers:
- name: gpu-test-container
image: registry.suse.com/bci/bci-base:latest
command: ["/bin/bash", "-c", "--"]
args: ["while true; do sleep 30; done;"]
securityContext:
privileged: true # Full access to the host
capabilities:
add:
- ALL # Grant all Linux capabilities
allowPrivilegeEscalation: true # Allow privilege escalation inside the container
runAsUser: 0 # Run the container as root
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "all"
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: Never
But even in this pod, nvidia-smi does not work... It always yields nvidia-smi: command not found
Can anybody guide me in which direction I should continue to dig?
Thanks, Pj
The text was updated successfully, but these errors were encountered:
Hi,
So I've fully installed a hardware device using an ISO built via EIB.
I've followed all the steps on https://documentation.suse.com/suse-edge/3.1/html/edge/id-nvidia-gpus-on-sle-micro.html#id-bringing-it-together-via-edge-image-builder
Although I had to add
compatWithCPUManager: true
tokubernetes/helm/values/nvidia-device-plugin.yaml
to get the device plugin working:Inside the device plugin pod I can run
nvidia-smi
.Now the weird stuff: I cannot run
nvidia-smi
in any other pod deployed on k3s.I can run
podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable -it registry.suse.com/bci/bci-base:latest bash
and therenvidia-smi
works.But if I deploy the same image in a pod on k3s,
nvidia-smi
doesnt work.To make sure I dont miss any settings, I tried to expose as many privileges etc as possible:
But even in this pod,
nvidia-smi
does not work... It always yieldsnvidia-smi: command not found
Can anybody guide me in which direction I should continue to dig?
Thanks, Pj
The text was updated successfully, but these errors were encountered: