-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set NVIDIA_DRIVER_CAPABILITIES
to all
when GPU is enabled
#19345
Conversation
The committers listed above are authorized under a signed CLA. |
Welcome @chubei-urus! |
Hi @chubei-urus. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Can one of the admins verify this patch? |
I'm new to the repo and don't know how this feature should be tested. Many thanks to anyone who can give some pointers! |
Thank you @chubei-urus for creating this PR, do you mind sharing a Before After this PR Example of running a workload |
Thank you for your quick reply. I'll create a minimal example. |
/ok-to-test |
kvm2 driver with docker runtime
Times for minikube start: 52.0s 46.2s 49.2s 50.8s 50.7s Times for minikube ingress: 29.0s 27.0s 24.9s 27.0s 24.4s docker driver with docker runtime
Times for minikube start: 23.9s 23.6s 23.2s 20.9s 23.8s Times for minikube (PR 19345) ingress: 22.7s 21.8s 21.7s 21.7s 22.7s docker driver with containerd runtime
Times for minikube start: 22.8s 20.8s 19.9s 23.6s 19.6s Times for minikube ingress: 48.3s 48.2s 48.2s 48.2s 48.2s |
Here are the number of top 10 failed tests in each environments with lowest flake rate.
Besides the following environments also have failed tests:
To see the flake rates of all tests by environment, click here. |
Steps
apiVersion: v1
kind: Pod
metadata:
name: vulkan
spec:
containers:
- name: vulkan
env:
- name: NVIDIA_DRIVER_CAPABILITIES
value: "graphics"
image: dualvtable/vulkan-sample
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: Never
BeforeThe logs look like:
AfterThe logs look like:
Tested on
Note that this is not the workload I was running, but I believe it shows the same issue. |
@@ -191,7 +191,7 @@ func CreateContainerNode(p CreateParams) error { //nolint to suppress cyclomatic | |||
runArgs = append(runArgs, "--ip", p.IP) | |||
} | |||
if p.GPUs != "" { | |||
runArgs = append(runArgs, "--gpus", "all") | |||
runArgs = append(runArgs, "--gpus", "all", "--env", "NVIDIA_DRIVER_CAPABILITIES=all") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for adding the example, and I found the documentation on this https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.10.0/user-guide.html
you are spot on ! it says "empty or unset | use default driver capability: utility, compute"I would love to see the example you provided be to be added as an integration test with the condition that it should skip the test if there is no GPU on the machine it avoid spamming failure on our CI machines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. I'll study how integration tests are implemented a bit and try to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chubei-urus here is an example of integraiton test
you can simply create a new file called
test/integration/gpu_ml_test.go
and create a new test there
and then you can have an if statment to skip the test if there gpu is not available on the test machine, for example
if hasGPU == false{
t.Skip("skipping test since the test machine does not have a GPU")
}
btw this would also be a good idea for a follow up PR, that if user machine does not have a GPU and they try to enable the gpu, we could warn them that you try to enable --gpus without one (follow up PR)
let me know if you have any questions
@chubei-urus I could merge this PR and if you like I would love to see a follow up adding integraiton test |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chubei-urus, medyagh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thank you I'd like an integration test but have been busy with other things. |
fixes #19318