Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate installing NVIDIA Container Toolkit --container-runtime #17287

Closed
wants to merge 4 commits into from

Conversation

spowelljr
Copy link
Member

@spowelljr spowelljr commented Sep 20, 2023

$ minikube start --container-runtime nvidia-docker
πŸ˜„  minikube v1.31.2 on Debian rodete
✨  Automatically selected the docker driver
πŸ“Œ  Using Docker driver with root privileges
πŸ‘  Starting control plane node minikube in cluster minikube
🚜  Pulling base image ...
πŸ”₯  Creating docker container (CPUs=2, Memory=32100MB) ...
πŸ› οΈ   Installing the NVIDIA Container Toolkit...
🐳  Preparing Kubernetes v1.28.2 on Docker 24.0.6 ...
    β–ͺ Generating certificates and keys ...
    β–ͺ Booting up control plane ...
    β–ͺ Configuring RBAC rules ...
πŸ”—  Configuring CNI (Container Networking Interface) ...
πŸ”Ž  Verifying Kubernetes components...
    β–ͺ Using image nvcr.io/nvidia/k8s-device-plugin:v0.14.1
    β–ͺ Using image gcr.io/k8s-minikube/storage-provisioner:v5
🌟  Enabled addons: nvidia-device-plugin, storage-provisioner, default-storageclass
πŸ„  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

$ cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-version-check
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-version-check
    image: "nvidia/cuda:11.0.3-base-ubuntu20.04"
    command: ["nvidia-smi"]
    resources:
      limits:
         nvidia.com/gpu: "1"
EOF
pod/nvidia-version-check created

$ kubectl logs nvidia-version-check
Fri Sep 22 18:45:31 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P1000        On   | 00000000:65:00.0 Off |                  N/A |
| 34%   26C    P8    N/A /  47W |     15MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

$ minikube start --driver kvm --container-runtime nvidia-docker
πŸ˜„  minikube v1.31.2 on Debian rodete (kvm/amd64)
✨  Using the kvm2 driver based on user configuration

❌  Exiting due to MK_USAGE: The nvidia-docker container-runtime can only be run with the docker driver
Screenshot 2023-09-25 at 10 53 49 AM

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 20, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: spowelljr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 20, 2023
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 21, 2023
@spowelljr spowelljr force-pushed the gpus branch 3 times, most recently from a569fbb to 091ff2d Compare September 25, 2023 17:54
@spowelljr spowelljr changed the title WIP: Automate installing NVIDIA Container Toolkit Automate installing NVIDIA Container Toolkit Sep 25, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 25, 2023
@spowelljr
Copy link
Member Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Sep 25, 2023
@minikube-pr-bot

This comment has been minimized.

@minikube-pr-bot

This comment has been minimized.

@minikube-pr-bot

This comment has been minimized.

@minikube-pr-bot

This comment has been minimized.

@minikube-pr-bot
Copy link

kvm2 driver with docker runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 17287) |
+----------------+----------+---------------------+
| minikube start | 50.2s    | 50.3s               |
| enable ingress | 27.0s    | 28.0s               |
+----------------+----------+---------------------+

Times for minikube start: 48.2s 51.3s 50.3s 50.4s 51.0s
Times for minikube (PR 17287) start: 50.1s 51.7s 50.8s 51.8s 47.0s

Times for minikube ingress: 27.1s 25.2s 27.7s 26.7s 28.1s
Times for minikube (PR 17287) ingress: 27.6s 27.7s 27.2s 29.2s 28.6s

docker driver with docker runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 17287) |
+----------------+----------+---------------------+
| minikube start | 24.4s    | 23.9s               |
| enable ingress | 21.0s    | 21.0s               |
+----------------+----------+---------------------+

Times for minikube start: 24.7s 25.1s 22.1s 24.1s 25.7s
Times for minikube (PR 17287) start: 25.1s 22.3s 25.1s 21.7s 25.1s

Times for minikube ingress: 20.8s 21.3s 21.3s 20.8s 20.8s
Times for minikube (PR 17287) ingress: 20.8s 20.8s 20.9s 22.8s 19.8s

docker driver with containerd runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 17287) |
+----------------+----------+---------------------+
| minikube start | 23.3s    | 22.8s               |
| enable ingress | 32.0s    | 29.1s               |
+----------------+----------+---------------------+

Times for minikube start: 23.4s 20.8s 23.8s 24.7s 23.7s
Times for minikube (PR 17287) start: 24.3s 21.4s 24.2s 23.3s 21.0s

Times for minikube ingress: 31.3s 31.3s 47.3s 18.4s 31.4s
Times for minikube (PR 17287) ingress: 31.4s 20.3s 31.3s 31.3s 31.3s

@minikube-pr-bot
Copy link

These are the flake rates of all failed tests.

Environment Failed Tests Flake Rate (%)
KVM_Linux_crio TestStartStop/group/no-preload/serial/Pause (gopogh) n/a
KVM_Linux_crio TestStartStop/group/no-preload/serial/VerifyKubernetesImages (gopogh) n/a
Docker_Linux_containerd TestKVMDriverInstallOrUpdate (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/CertSync (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/DockerEnv/powershell (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ImageCommands/ImageBuild (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ImageCommands/ImageListJson (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ImageCommands/ImageListShort (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ImageCommands/ImageListTable (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ImageCommands/ImageListYaml (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ImageCommands/ImageLoadDaemon (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ImageCommands/ImageLoadFromFile (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ImageCommands/ImageReloadDaemon (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ImageCommands/ImageSaveToFile (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ImageCommands/ImageTagAndLoadDaemon (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/MySQL (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/NodeLabels (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/PersistentVolumeClaim (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ServiceCmdConnect (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ServiceCmd/DeployApp (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ServiceCmd/Format (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ServiceCmd/HTTPS (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ServiceCmd/JSONOutput (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ServiceCmd/List (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/ServiceCmd/URL (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/StatusCmd (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/TunnelCmd/serial/RunSecondTunnel (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/TunnelCmd/serial/WaitService/Setup (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/parallel/UpdateContextCmd/no_changes (gopogh) 0.00 (chart)
Hyper-V_Windows TestFunctional/serial/CacheCmd/cache/cache_reload (gopogh) 0.00 (chart)
More tests... Continued...

Too many tests failed - See test logs for more details.

To see the flake rates of all tests by environment, click here.

Copy link
Member

@medyagh medyagh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets try to make another PR that will keep container runtimes 3 and have a flag called --enable-nvidia that way if we later have amd-gpu and also containerd and crio we could enable it fo rthem

@medyagh medyagh changed the title Automate installing NVIDIA Container Toolkit Automate installing NVIDIA Container Toolkit --container-runtime Oct 2, 2023
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 4, 2023
@spowelljr spowelljr closed this Oct 5, 2023
@spowelljr spowelljr deleted the gpus branch October 24, 2023 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants