container-runtime: add nvidia-docker #15927

d4l3k · 2023-02-26T08:08:24Z

This adds a new container-runtime that sets the correct configuration options for using with https://github.com/NVIDIA/k8s-device-plugin#nvidia-device-plugin-for-kubernetes

This requires a custom Dockerfile with nvidia-container-toolkit and a matching libnvidia-ml.so.1 file. The driver version on the host needs to exactly match the version of nvml in the container.

kicbase Dockerfile

FROM gcr.io/k8s-minikube/kicbase:v0.0.37

RUN curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
RUN curl -s -L https://nvidia.github.io/libnvidia-container/$(. /etc/os-release;echo $ID$VERSION_ID)/libnvidia-container.list | tee /etc/apt/sources.list.d/libnvidia-container.list
RUN apt-get update && sudo apt-get install -y nvidia-container-toolkit && rm -rf /var/lib/apt/lists/*

ADD libnvidia-ml.so.1 /usr/lib/libnvidia-ml.so.1

Commands to run:

# Build docker image
cp /usr/lib/libnvidia-ml.so.1 .
docker build -t nvidiakic .

# Example start minikube with nvidia-docker
go run ./cmd/minikube start --container-runtime nvidia-docker --base-image='nvidiakic' --iso-url='https://storage.googleapis.com/minikube/iso/minikube-v1.29.0-amd64.iso,https://github.com/kubernetes/minikube/releases/download/v1.29.0/minikube-v1.29.0-amd64.iso' --driver docker --cpus=max --memory=max --nodes=2

# install the normal k8s-deviced-plugin (not the minikube addon!)
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml

Related issue #10229

linux-foundation-easycla · 2023-02-26T08:08:30Z

The committers listed above are authorized under a signed CLA.

✅ login: d4l3k / name: Tristan Rice (2c3b698)
✅ login: spowelljr / name: Steven Powell (262f8ce, 2a1f5b9)

k8s-ci-robot · 2023-02-26T08:08:32Z

Welcome @d4l3k!

It looks like this is your first PR to kubernetes/minikube 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/minikube has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2023-02-26T08:08:33Z

Hi @d4l3k. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

minikube-bot · 2023-02-26T08:12:05Z

Can one of the admins verify this patch?

d4l3k · 2023-02-26T08:12:44Z

I just signed the CLA, hasn't updated yet though

d4l3k · 2023-02-26T17:44:58Z

@sharifelgamal when you get a chance could you take a look at this PR?

I'm also wondering if you have any suggestions on how to handle the nvidia dependencies. I assume it doesn't make sense to add them to kicbase? Also the libnvidia-ml needs to match the host. We could try and grab it at runtime and overlay it in the container but that's pretty hacky. For my use case, right now building an custom kicbase is an acceptable step so this PR is sufficient

Thanks!

sazzy4o · 2023-02-28T05:00:04Z

The nvidia oci hook creates the following mounts (on my ubuntu machine), so I don't think it is crazy to do overlays (seems like that is what nvidia is doing for their integrations):

--mount=type=bind,source=/usr/bin/nvidia-smi,destination=/usr/bin/nvidia-smi,ro=true \
--mount=type=bind,source=/usr/bin/nvidia-debugdump,destination=/usr/bin/nvidia-debugdump,ro=true \
--mount=type=bind,source=/usr/bin/nvidia-persistenced,destination=/usr/bin/nvidia-persistenced,ro=true \
--mount=type=bind,source=/usr/bin/nvidia-cuda-mps-control,destination=/usr/bin/nvidia-cuda-mps-control,ro=true \
--mount=type=bind,source=/usr/bin/nvidia-cuda-mps-server,destination=/usr/bin/nvidia-cuda-mps-server,ro=true \
--mount=type=bind,source=/usr/lib/x86_64-linux-gnu/libnvidia-ml.so,destination=/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.525.78.01,ro=true \
--mount=type=bind,source=/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so,destination=/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.525.78.01,ro=true \
--mount=type=bind,source=/usr/lib/x86_64-linux-gnu/libcuda.so,destination=/usr/lib/x86_64-linux-gnu/libcuda.so.525.78.01,ro=true \
--mount=type=bind,source=/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so,destination=/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.525.78.01,ro=true \
--mount=type=bind,source=/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so,destination=/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.525.78.01,ro=true \
--mount=type=bind,source=/usr/lib/x86_64-linux-gnu/libnvidia-ml.so,destination=/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1,ro=true \
--mount=type=bind,source=/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so,destination=/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1,ro=true \
--mount=type=bind,source=/usr/lib/x86_64-linux-gnu/libcuda.so,destination=/usr/lib/x86_64-linux-gnu/libcuda.so.1,ro=true \
--mount=type=bind,source=/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so,destination=/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1,ro=true \
--mount=type=bind,source=/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so,destination=/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1,ro=true \
--mount=type=bind,source=/usr/lib/x86_64-linux-gnu/libnvidia-ml.so,destination=/usr/lib/x86_64-linux-gnu/libnvidia-ml.so,ro=true \
--mount=type=bind,source=/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so,destination=/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so,ro=true \
--mount=type=bind,source=/usr/lib/x86_64-linux-gnu/libcuda.so,destination=/usr/lib/x86_64-linux-gnu/libcuda.so,ro=true \
--mount=type=bind,source=/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so,destination=/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so,ro=true \
--mount=type=bind,source=/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so,destination=/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so,ro=true \
--mount=type=bind,source=/run/nvidia-persistenced/socket,destination=/run/nvidia-persistenced/socket \
--mount=type=bind,source=/dev/nvidiactl,destination=/dev/nvidiactl \
--mount=type=bind,source=/dev/nvidia-uvm,destination=/dev/nvidia-uvm \
--mount=type=bind,source=/dev/nvidia-uvm-tools,destination=/dev/nvidia-uvm-tools \
--mount=type=bind,source=/dev/nvidia0,destination=/dev/nvidia0 \
--mount=type=bind,source=/proc/driver/nvidia/gpus/0000:65:00.0,destination=/proc/driver/nvidia/gpus/0000:65:00.0 \
--mount=type=tmpfs,destination=/proc/driver/nvidia \

k8s-ci-robot · 2023-03-23T00:36:14Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

anthonyalayo · 2023-06-07T07:57:18Z

What's left here? Merging this would be great!

medyagh · 2023-09-06T18:49:49Z

@d4l3k sorry for the long delay in PR review, I would llke to know how this PR is different from nivdia addon ? can nvidia addon be enabled with this PR as well ?
https://minikube.sigs.k8s.io/docs/handbook/addons/nvidia/

also I would like you to contribute the kicbase changes as well so we could test it too

d4l3k · 2023-09-06T19:01:56Z

Kvm requires the GPU to use pcie pass through to the underlying VM. This PR instead makes the gpu device available with the host gpu driver so it can be shared between the host and minikube workers

spowelljr · 2023-09-11T23:01:14Z

Hi @d4l3k, I tried your example and it seems to work. Is there a way I can confirm that the pods have access to the GPUs? I tried using TensorFlow but was getting:

2023-09-11 22:46:18.090397: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

d4l3k · 2023-09-11T23:10:45Z

@spowelljr do you have access to the nvidia-smi command line within the container? That should tell you if you can access the GPUs or not

I was testing this with TorchX https://pytorch.org/torchx/latest/quickstart.html but that requires some familiarity with pytorch to get started

d4l3k · 2023-09-11T23:13:55Z

I wonder if there's other better options here as well via some of the other runtimes that might be easier to integrate https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuration

spowelljr · 2023-09-12T00:05:04Z

I tried getting nvidia-smi in the container but haven't been successful yet.

I see examples like this (https://jacobtomlinson.dev/posts/2022/how-to-check-your-nvidia-driver-and-cuda-version-in-kubernetes/)

But when I try nvidia-smi is not present, maybe they re-pushed the imaged with changes?

I tried installing the nvidia driver in a pod but got Failed to initialize NVML: Unknown Error when running nvidia-smi

I'm new to running GPUs and AI/ML workloads in Kubernetes, so I'm welcome to any tips you may have

d4l3k · 2023-09-12T00:12:17Z

@spowelljr have you tried getting nvidia-smi to work under just nvidia-docker as a first step? Once you have a good repro can then try it in minikube as well

spowelljr · 2023-09-14T16:59:07Z

@d4l3k I was able to get nvidia-smi to run in a container and have Tensorflow pickup the GPU, but I needed to do a couple more steps.

I had to install NVIDIA Container Toolkit on my host machine
I had to add --gpus=all to the command that starts the minikube container

Then I was able to use your PR to start minikube

d4l3k · 2023-09-15T00:51:09Z

Ahh nice! That makes sense, need it installed on the host to mount it and also need to set a Docker flags

spowelljr · 2023-09-19T16:24:21Z

/ok-to-test

spowelljr · 2023-09-19T17:51:35Z

@d4l3k If the tests look good I'll merge this PR. I'm working on a follow up PR that will make this work via a flag when starting minikube. I've discovered that I was able to get it to work without ADD libnvidia-ml.so.1 /usr/lib/libnvidia-ml.so.1 which could simplify the flow. Could you confirm if you're able to get it to work without it or why you needed it.

d4l3k · 2023-09-19T18:22:06Z

@spowelljr thanks for pushing this through! Looks good to me :)

As for the ADD libnvidia-ml.so.1 it's been a while but I believe this was due to the fact that I had a different CUDA/driver version in the container compared to my host machine. I was using Arch Linux which had a newer CUDA and driver version than the one available in the container OS. If you were running the same distribution on host and container it may have avoided that issue?

It may also be nvidia-container-toolkit takes care of it now and it's no longer necessary

spowelljr · 2023-09-20T16:12:49Z

/retest-this-please

minikube-pr-bot · 2023-09-20T17:34:54Z

kvm2 driver with docker runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 15927) |
+----------------+----------+---------------------+
| minikube start | 50.9s    | 50.4s               |
| enable ingress | 28.1s    | 28.5s               |
+----------------+----------+---------------------+

Times for minikube (PR 15927) start: 50.2s 50.6s 51.7s 49.5s 49.6s
Times for minikube start: 52.2s 50.9s 51.3s 49.9s 50.2s

Times for minikube ingress: 27.7s 28.1s 27.7s 28.6s 28.2s
Times for minikube (PR 15927) ingress: 27.2s 28.1s 29.1s 28.6s 29.6s

docker driver with docker runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 15927) |
+----------------+----------+---------------------+
| minikube start | 22.5s    | 23.0s               |
| enable ingress | 21.0s    | 20.8s               |
+----------------+----------+---------------------+

Times for minikube start: 24.4s 23.3s 21.6s 21.1s 21.8s
Times for minikube (PR 15927) start: 23.9s 21.7s 22.0s 21.7s 25.8s

Times for minikube (PR 15927) ingress: 20.8s 20.8s 20.8s 20.8s 20.8s
Times for minikube ingress: 20.9s 20.8s 21.8s 20.8s 20.8s

docker driver with containerd runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 15927) |
+----------------+----------+---------------------+
| minikube start | 22.0s    | 21.5s               |
| enable ingress | 34.1s    | 34.7s               |
+----------------+----------+---------------------+

Times for minikube start: 20.9s 21.1s 20.6s 24.1s 23.2s
Times for minikube (PR 15927) start: 19.6s 20.5s 23.6s 20.4s 23.3s

Times for minikube ingress: 27.4s 49.4s 31.3s 31.3s 31.3s
Times for minikube (PR 15927) ingress: 47.3s 31.3s 32.3s 31.3s 31.3s

minikube-pr-bot · 2023-09-20T19:05:38Z

These are the flake rates of all failed tests.

Environment	Failed Tests	Flake Rate (%)
KVM_Linux_containerd	TestErrorSpam/setup (gopogh)	0.00 (chart)
KVM_Linux	TestCertOptions (gopogh)	0.56 (chart)
KVM_Linux	TestStartStop/group/old-k8s-version/serial/VerifyKubernetesImages (gopogh)	0.56 (chart)
Hyperkit_macOS	TestStartStop/group/old-k8s-version/serial/VerifyKubernetesImages (gopogh)	0.58 (chart)
Hyperkit_macOS	TestAddons/Setup (gopogh)	2.33 (chart)
Hyperkit_macOS	TestJSONOutput/start/parallel/DistinctCurrentSteps (gopogh)	2.33 (chart)
Hyperkit_macOS	TestJSONOutput/start/parallel/IncreasingCurrentSteps (gopogh)	2.33 (chart)
Hyperkit_macOS	TestMinikubeProfile (gopogh)	16.28 (chart)

To see the flake rates of all tests by environment, click here.

spowelljr

Thanks for the PR!

k8s-ci-robot · 2023-09-20T21:09:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: d4l3k, spowelljr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [spowelljr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

container-runtime: add nvidia-docker

2c3b698

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 26, 2023

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 26, 2023

k8s-ci-robot requested review from afbjorklund and medyagh February 26, 2023 08:09

d4l3k mentioned this pull request Feb 26, 2023

NVIDIA support with docker driver #10229

Open

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 26, 2023

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 23, 2023

Merge branch 'master' into master

262f8ce

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 19, 2023

fix rebased test

2a1f5b9

spowelljr removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 19, 2023

This comment has been minimized.

Sign in to view

spowelljr approved these changes Sep 20, 2023

View reviewed changes

spowelljr merged commit 075f1a1 into kubernetes:master Sep 20, 2023
13 of 15 checks passed

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

container-runtime: add nvidia-docker #15927

container-runtime: add nvidia-docker #15927

d4l3k commented Feb 26, 2023

linux-foundation-easycla bot commented Feb 26, 2023 •

edited

Loading

k8s-ci-robot commented Feb 26, 2023

k8s-ci-robot commented Feb 26, 2023

minikube-bot commented Feb 26, 2023

d4l3k commented Feb 26, 2023

d4l3k commented Feb 26, 2023

sazzy4o commented Feb 28, 2023 •

edited

Loading

k8s-ci-robot commented Mar 23, 2023

anthonyalayo commented Jun 7, 2023

medyagh commented Sep 6, 2023 •

edited

Loading

d4l3k commented Sep 6, 2023

spowelljr commented Sep 11, 2023

d4l3k commented Sep 11, 2023

d4l3k commented Sep 11, 2023

spowelljr commented Sep 12, 2023

d4l3k commented Sep 12, 2023

spowelljr commented Sep 14, 2023

d4l3k commented Sep 15, 2023

spowelljr commented Sep 19, 2023

This comment has been minimized.

spowelljr commented Sep 19, 2023

d4l3k commented Sep 19, 2023

This comment has been minimized.

spowelljr commented Sep 20, 2023

minikube-pr-bot commented Sep 20, 2023

minikube-pr-bot commented Sep 20, 2023

spowelljr left a comment

k8s-ci-robot commented Sep 20, 2023

container-runtime: add nvidia-docker #15927

container-runtime: add nvidia-docker #15927

Conversation

d4l3k commented Feb 26, 2023

linux-foundation-easycla bot commented Feb 26, 2023 • edited Loading

k8s-ci-robot commented Feb 26, 2023

k8s-ci-robot commented Feb 26, 2023

minikube-bot commented Feb 26, 2023

d4l3k commented Feb 26, 2023

d4l3k commented Feb 26, 2023

sazzy4o commented Feb 28, 2023 • edited Loading

k8s-ci-robot commented Mar 23, 2023

anthonyalayo commented Jun 7, 2023

medyagh commented Sep 6, 2023 • edited Loading

d4l3k commented Sep 6, 2023

spowelljr commented Sep 11, 2023

d4l3k commented Sep 11, 2023

d4l3k commented Sep 11, 2023

spowelljr commented Sep 12, 2023

d4l3k commented Sep 12, 2023

spowelljr commented Sep 14, 2023

d4l3k commented Sep 15, 2023

spowelljr commented Sep 19, 2023

This comment has been minimized.

spowelljr commented Sep 19, 2023

d4l3k commented Sep 19, 2023

This comment has been minimized.

spowelljr commented Sep 20, 2023

minikube-pr-bot commented Sep 20, 2023

minikube-pr-bot commented Sep 20, 2023

spowelljr left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 20, 2023

linux-foundation-easycla bot commented Feb 26, 2023 •

edited

Loading

sazzy4o commented Feb 28, 2023 •

edited

Loading

medyagh commented Sep 6, 2023 •

edited

Loading