Add the nvidia runtime cdi #11065

manuelbuil · 2024-10-10T17:52:54Z

Proposed Changes

Add the nvidia runtime cdi, so that pods can use the GPU with the full functionality

Types of Changes

Bugfix

Verification

In an env with a GPU and the OS drivers correctly installed, deploy the gpu operator:

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
 name: gpu-operator
 namespace: kube-system
spec:
 repo: https://helm.ngc.nvidia.com/nvidia
 chart: gpu-operator
 targetNamespace: gpu-operator
 createNamespace: true
 valuesContent: |-
   toolkit:
     env:
     - name: CONTAINERD_SOCKET
       value: /run/k3s/containerd/containerd.sock

And after some minutes, check that /var/lib/rancher/rke2/agent/etc/containerd/config.toml includes the nvidia runtimes at the bottom (both nvidia and nvidia-cdi).

Then create a pod including:

      runtimeClassName: nvidia
      containers:
      - image: xxxxxxx
        name: xxxxxxx
        resources:
          limits:
            nvidia.com/gpu: 1
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: all

And execute the command mount, you should see /dev/nvidia... devices being mounted and /usr/lib/libnvidia libraries being mounted

Testing

Linked Issues

#11087

User-Facing Change

Add nvidia cdi runtime to the list of supported and discoverable runtimes

Further Comments

Signed-off-by: manuelbuil <[email protected]>

brandond

You should add the corresponding RuntimeClassName definition to the list at

k3s/manifests/runtimes.yaml

Lines 1 to 11 in 7552203

    
           apiVersion: node.k8s.io/v1 
        
           kind: RuntimeClass 
        
           metadata: 
        
             name: nvidia 
        
           handler: nvidia 
        
           --- 
        
           apiVersion: node.k8s.io/v1 
        
           kind: RuntimeClass 
        
           metadata: 
        
             name: nvidia-experimental 
        
           handler: nvidia-experimental

Your current example is using the legacy nvidia runtime, not the new nvidia-cdi runtime that you're adding; did you want to use runtimeClassName: nvidia-cdi? Or is the nvidia operator modifying our nvidia RuntimeClass definition so that it uses the nvidia-cdi handler?

codecov · 2024-10-10T18:27:50Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 44.23%. Comparing base (7552203) to head (e767089).
Report is 4 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (7552203) and HEAD (e767089). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (7552203) HEAD (e767089)

e2etests 7 6

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #11065      +/-   ##
==========================================
- Coverage   49.93%   44.23%   -5.71%     
==========================================
  Files         178      178              
  Lines       14816    14820       +4     
==========================================
- Hits         7399     6555     -844     
- Misses       6069     7056     +987     
+ Partials     1348     1209     -139

Flag	Coverage Δ
e2etests	`36.44% <100.00%> (-9.61%)`	⬇️
inttests	`36.83% <100.00%> (-0.04%)`	⬇️
unittests	`13.57% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

manuelbuil · 2024-10-11T10:26:09Z

You should add the corresponding RuntimeClassName definition to the list at

k3s/manifests/runtimes.yaml

Lines 1 to 11 in 7552203

apiVersion: node.k8s.io/v1

kind: RuntimeClass

metadata:

name: nvidia

handler: nvidia

---

apiVersion: node.k8s.io/v1

kind: RuntimeClass

metadata:

name: nvidia-experimental

handler: nvidia-experimental

Your current example is using the legacy nvidia runtime, not the new nvidia-cdi runtime that you're adding; did you want to use runtimeClassName: nvidia-cdi? Or is the nvidia operator modifying our nvidia RuntimeClass definition so that it uses the nvidia-cdi handler?

It's confusing and I don't have an good answer but this is what I see:
Things work (I see nvidia drivers and libs in the pod) with runtimeClassName: nvidia as long as:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia-cdi"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia-cdi".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi"
  SystemdCgroup = true

is present in /var/lib/rancher/rke2/agent/etc/containerd/config.toml. I see the handler still being nvidia:

NAME            HANDLER         AGE
nvidia          nvidia          9m23s

so, I don't think the operator is touching it.

If I define a new runtimeclass with nvidia-cdi, things work only if I install the gpu operator with:

   cdi:
     enabled: true

and I'm not sure how it works because containerd by default has cdi false, so I wonder if things work at all inside the container.

My guess is that the default nvidia runtime must be applying some workarounds for use cases when cdi is not enabled and to apply such workarounds, it requires the nvidia.cdi runtime to be present in the containerd config.

Since this PR by itself fixes the problem that the nvidia user reported, I'd merge the PR as it is. In the next days with more time, I can try to go deep and get a better understanding of what's going on (or ideally try to get help from nvidia)

brandond · 2024-10-11T17:15:07Z

Ok... If it is handled internally by the nvidia driver and we don't need to be able to reference it by runtimeClassName in pods then I guess we can skip that for now. That is very unusual though. I'll have to do some more research on how this is working under the hood.

Add the nvidia runtime cdi

e767089

Signed-off-by: manuelbuil <[email protected]>

manuelbuil requested a review from a team as a code owner October 10, 2024 17:52

brandond requested changes Oct 10, 2024

View reviewed changes

manuelbuil requested a review from brandond October 11, 2024 13:50

dereknola approved these changes Oct 11, 2024

View reviewed changes

brandond approved these changes Oct 11, 2024

View reviewed changes

briandowns approved these changes Oct 11, 2024

View reviewed changes

manuelbuil merged commit 054cec8 into k3s-io:master Oct 11, 2024
30 checks passed

manuelbuil deleted the addnvidiaruntimecdi branch October 11, 2024 19:38

This was referenced Oct 11, 2024

[Release 1.30] Add the nvidia runtime cdi #11092

Merged

[Release 1.31] Add the nvidia runtime cdi #11093

Merged

[Release 1.29] Add the nvidia runtime cdi #11094

Merged

[Release 1.28] Add the nvidia runtime cdi #11095

Merged

manuelbuil mentioned this pull request Oct 24, 2024

Update https://docs.rke2.io/advanced#deploy-nvidia-operator for SLES OS rancher/rke2-docs#263

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the nvidia runtime cdi #11065

Add the nvidia runtime cdi #11065

manuelbuil commented Oct 10, 2024 •

edited

Loading

brandond left a comment •

edited

Loading

codecov bot commented Oct 10, 2024 •

edited

Loading

manuelbuil commented Oct 11, 2024

brandond commented Oct 11, 2024

	apiVersion: node.k8s.io/v1
	kind: RuntimeClass
	metadata:
	name: nvidia
	handler: nvidia
	---
	apiVersion: node.k8s.io/v1
	kind: RuntimeClass
	metadata:
	name: nvidia-experimental
	handler: nvidia-experimental

Add the nvidia runtime cdi #11065

Add the nvidia runtime cdi #11065

Conversation

manuelbuil commented Oct 10, 2024 • edited Loading

Proposed Changes

Types of Changes

Verification

Testing

Linked Issues

User-Facing Change

Further Comments

brandond left a comment • edited Loading

Choose a reason for hiding this comment

codecov bot commented Oct 10, 2024 • edited Loading

Codecov Report

manuelbuil commented Oct 11, 2024

brandond commented Oct 11, 2024

manuelbuil commented Oct 10, 2024 •

edited

Loading

brandond left a comment •

edited

Loading

codecov bot commented Oct 10, 2024 •

edited

Loading