-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the nvidia runtime cdi #11065
Add the nvidia runtime cdi #11065
Conversation
Signed-off-by: manuelbuil <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add the corresponding RuntimeClassName definition to the list at
Lines 1 to 11 in 7552203
apiVersion: node.k8s.io/v1 | |
kind: RuntimeClass | |
metadata: | |
name: nvidia | |
handler: nvidia | |
--- | |
apiVersion: node.k8s.io/v1 | |
kind: RuntimeClass | |
metadata: | |
name: nvidia-experimental | |
handler: nvidia-experimental |
Your current example is using the legacy nvidia runtime, not the new nvidia-cdi runtime that you're adding; did you want to use runtimeClassName: nvidia-cdi
? Or is the nvidia operator modifying our nvidia RuntimeClass definition so that it uses the nvidia-cdi handler?
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #11065 +/- ##
==========================================
- Coverage 49.93% 44.23% -5.71%
==========================================
Files 178 178
Lines 14816 14820 +4
==========================================
- Hits 7399 6555 -844
- Misses 6069 7056 +987
+ Partials 1348 1209 -139
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
It's confusing and I don't have an good answer but this is what I see:
is present in
so, I don't think the operator is touching it. If I define a new runtimeclass with nvidia-cdi, things work only if I install the gpu operator with:
and I'm not sure how it works because containerd by default has cdi false, so I wonder if things work at all inside the container. My guess is that the default nvidia runtime must be applying some workarounds for use cases when cdi is not enabled and to apply such workarounds, it requires the nvidia.cdi runtime to be present in the containerd config. Since this PR by itself fixes the problem that the nvidia user reported, I'd merge the PR as it is. In the next days with more time, I can try to go deep and get a better understanding of what's going on (or ideally try to get help from nvidia) |
Ok... If it is handled internally by the nvidia driver and we don't need to be able to reference it by runtimeClassName in pods then I guess we can skip that for now. That is very unusual though. I'll have to do some more research on how this is working under the hood. |
Proposed Changes
Add the nvidia runtime cdi, so that pods can use the GPU with the full functionality
Types of Changes
Bugfix
Verification
In an env with a GPU and the OS drivers correctly installed, deploy the gpu operator:
And after some minutes, check that
/var/lib/rancher/rke2/agent/etc/containerd/config.toml
includes the nvidia runtimes at the bottom (both nvidia and nvidia-cdi).Then create a pod including:
And execute the command
mount
, you should see/dev/nvidia...
devices being mounted and/usr/lib/libnvidia
libraries being mountedTesting
Linked Issues
#11087
User-Facing Change
Further Comments