Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release-1.31] - Nvidia operator not working correctly #11088

Closed
manuelbuil opened this issue Oct 11, 2024 · 1 comment
Closed

[Release-1.31] - Nvidia operator not working correctly #11088

manuelbuil opened this issue Oct 11, 2024 · 1 comment
Assignees
Milestone

Comments

@manuelbuil
Copy link
Contributor

Backport fix for Nvidia operator not working correctly

@VestigeJ
Copy link

##Environment Details
Reproduced using VERSION=v1.31.1+k3s1
Validated using COMMIT=221ab22ca911b548d7278afb0df7fca17d2fe596

Infrastructure

  • Cloud
p3.2xlarge instance type
00:1e.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)

sudo nvidia-smi
Mon Oct 21 23:02:38 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-16GB           Off |   00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0             25W /  300W |       1MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found //this will change very fast if you test with the vector-add image
+-----------------------------------------------------------------------------------------+

Node(s) CPU architecture, OS, and version:

Linux 6.4.0-150600.23.17-default x86_64 GNU/Linux
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP6"

Cluster Configuration:

NAME              STATUS   ROLES                       AGE   VERSION
ip-1-1-1-23       Ready    control-plane,etcd,master   53m   v1.31.1+k3s-221ab22c

Config.yaml:

node-external-ip: 1.1.1.23
token: YOUR_TOKEN_HERE
write-kubeconfig-mode: 644
debug: true
cluster-init: true
embedded-registry: true

Reproduction && Validation

$ curl https://get.k3s.io --output install-"k3s".sh
$ sudo chmod +x install-"k3s".sh
$ sudo groupadd --system etcd && sudo useradd -s /sbin/nologin --system -g etcd etcd
$ sudo modprobe ip_vs_rr
$ sudo modprobe ip_vs_wrr
$ sudo modprobe ip_vs_sh
$ sudo printf "on_oovm.panic_on_oom=0 \nvm.overcommit_memory=1 \nkernel.panic=10 \nkernel.panic_ps=1 \nkernel.panic_on_oops=1 \n" > ~/90-kubelet.conf
$ sudo cp 90-kubelet.conf /etc/sysctl.d/
$ sudo systemctl restart systemd-sysctl
$ sudo zypper ar https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
$ sudo zypper modifyrepo --enable nvidia-container-toolkit-experimental
$ sudo zypper --gpg-auto-import-keys install -y nvidia-container-toolkit
$ VERSION=v1.31.1+k3s1
$ sudo INSTALL_K3S_VERSION=$VERSION INSTALL_K3S_EXEC=server ./install-k3s.sh
$ kg runtimeclass
$ vim nvidia-pod.yaml
$ k apply -f nvidia-pod.yaml
$ nvidia-ctk cdi list //interestingly this still shows 0 devices
    sudo nvidia-ctk cdi list
    INFO[0000] Found 0 CDI devices

$ sudo zypper addrepo --refresh 'https://developer.download.nvidia.com/compute/cuda/repos/sles15/x86_64/' NVIDIA
$ sudo zypper --gpg-auto-import-keys refresh
$ sudo zypper install -y nvidia-gl-G06 nvidia-video-G06 nvidia-compute-utils-G06
$ sudo reboot
$ vim cuda-add.yaml
$ k apply -f cuda-add.yaml //note this vector-add image still works on never drivers and different OS's it was more brittle in the past but the output quickly flashes across the nvidia-smi output so keep a watch for processes to change on the output page.
$ k delete -f cuda-add.yaml
$ k apply -f pytorch-gpu.yaml
$ sudo cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml
$ COMMIT=221ab22ca911b548d7278afb0df7fca17d2fe596
$ sudo INSTALL_K3S_COMMIT=$COMMIT INSTALL_K3S_EXEC=server ./install-k3s.sh
$ sudo cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml
$ kgp -A //ensure all running pods / node remain healthy that should be

Results:

Before from existing release v1.31.1+k3s1
truncated down to only nvidia related entries

$ sudo cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
  SystemdCgroup = true

Newest COMMIT ID installation now shows additional nvidia-cdi entries on config.toml

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia-cdi"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia-cdi".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi"
  SystemdCgroup = true

Seems to be required now but isn't documented well on the k3s side yet

$ cat operator.yaml

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: gpu-operator
  namespace: kube-system
spec:
  repo: https://helm.ngc.nvidia.com/nvidia
  chart: gpu-operator
  targetNamespace: gpu-operator
  createNamespace: true
  valuesContent: |-
    toolkit:
      env:
      - name: CONTAINERD_SOCKET
        value: /run/k3s/containerd/containerd.sock

$ cat cuda-add.yaml

apiVersion: v1
kind: Pod
metadata:
 name: test-cuda-vector-add
spec:
  restartPolicy: "OnFailure"
  runtimeClassName: "nvidia"
  terminationGracePeriodSeconds: 15
  containers:
  - name: vectoradd-cuda
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      limits:
         nvidia.com/gpu: 1

$ cat nvidia-pod.yaml

cat nvidia-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
    args: ["nbody", "-gpu", "-benchmark"]
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all

$ cat pytorch-gpu.yaml

apiVersion: v1
kind: Pod
metadata:
  name: pytorch-test
spec:
  runtimeClassName: nvidia
  containers:
  - name: pytorch-container
    image: pytorch/pytorch:latest   # Use the latest PyTorch image
    command: ["/bin/bash", "-c", "sleep infinity"]  # Keeps the container running
    resources:
      limits:
        nvidia.com/gpu: 1           # If using GPUs, request a GPU
    env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: all

$ k exec --stdin --tty pytorch-test -- /bin/bash

root@pytorch-test:/workspace# mount | grep -i nvidia
/dev/xvda3 on /usr/lib64/libnvidia-egl-gbm.so.1.1.1 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib64/libnvidia-egl-wayland.so.1.1.13 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /etc/vulkan/icd.d/nvidia_icd.json type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /etc/vulkan/implicit_layer.d/nvidia_layers.json type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/share/nvidia/nvoptix.bin type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/share/glvnd/egl_vendor.d/10_nvidia.json type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib64/xorg/modules/drivers/nvidia_drv.so type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib64/xorg/modules/extensions/libglxserver_nvidia.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
tmpfs on /proc/driver/nvidia type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=555,inode64)
tmpfs on /etc/nvidia/nvidia-application-profiles-rc.d type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=555,inode64)
/dev/xvda3 on /usr/bin/nvidia-smi type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/bin/nvidia-debugdump type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/bin/nvidia-persistenced type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/bin/nvidia-cuda-mps-control type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/bin/nvidia-cuda-mps-server type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libvdpau_nvidia.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/i386-linux-gnu/libnvidia-ml.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/i386-linux-gnu/libnvidia-opencl.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/i386-linux-gnu/libnvidia-gpucomp.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/i386-linux-gnu/libnvidia-nvvm.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/i386-linux-gnu/libnvidia-glcore.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/i386-linux-gnu/libnvidia-tls.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/i386-linux-gnu/libnvidia-glsi.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/i386-linux-gnu/libnvidia-fbc.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/i386-linux-gnu/libGLX_nvidia.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/i386-linux-gnu/libEGL_nvidia.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/xvda3 on /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.560.35.03 type xfs (ro,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
devtmpfs on /dev/nvidiactl type devtmpfs (ro,nosuid,noexec,size=4096k,nr_inodes=7833885,mode=755,inode64)
devtmpfs on /dev/nvidia-uvm type devtmpfs (ro,nosuid,noexec,size=4096k,nr_inodes=7833885,mode=755,inode64)
devtmpfs on /dev/nvidia-uvm-tools type devtmpfs (ro,nosuid,noexec,size=4096k,nr_inodes=7833885,mode=755,inode64)
devtmpfs on /dev/nvidia-modeset type devtmpfs (ro,nosuid,noexec,size=4096k,nr_inodes=7833885,mode=755,inode64)
devtmpfs on /dev/nvidia0 type devtmpfs (ro,nosuid,noexec,size=4096k,nr_inodes=7833885,mode=755,inode64)
proc on /proc/driver/nvidia/gpus/0000:00:1e.0 type proc (ro,nosuid,nodev,noexec,relatime)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done Issue
Development

No branches or pull requests

4 participants