Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for AMD GPUs via --gpus=amd #19749

Merged
merged 1 commit into from
Oct 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions .github/workflows/update-amd-gpu-device-plugin-version.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: "update-amd-gpu-device-plugin-version"
on:
workflow_dispatch:
schedule:
# every Monday at around 3 am pacific/10 am UTC
- cron: "0 10 * * 1"
env:
GOPROXY: https://proxy.golang.org
GO_VERSION: '1.23.0'
permissions:
contents: read

jobs:
bump-amd-gpu-device-plugin-version:
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@d632683dd7b4114ad314bca15554477dd762a938
- uses: actions/setup-go@0a12ed9d6a96ab950c8f026ed9f722fe0da7ef32
with:
go-version: ${{env.GO_VERSION}}
- name: Bump amd-gpu-device-plugin version
id: bumpAmdDevicePlugin
run: |
echo "OLD_VERSION=$(DEP=amd-gpu-device-plugin make get-dependency-version)" >> "$GITHUB_OUTPUT"
make update-amd-gpu-device-plugin-version
echo "NEW_VERSION=$(DEP=amd-gpu-device-plugin make get-dependency-version)" >> "$GITHUB_OUTPUT"
# The following is to support multiline with GITHUB_OUTPUT, see https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#multiline-strings
echo "changes<<EOF" >> "$GITHUB_OUTPUT"
echo "$(git status --porcelain)" >> "$GITHUB_OUTPUT"
echo "EOF" >> "$GITHUB_OUTPUT"
- name: Create PR
if: ${{ steps.bumpAmdDevicePlugin.outputs.changes != '' }}
uses: peter-evans/create-pull-request@5e914681df9dc83aa4e4905692ca88beb2f9e91f
with:
token: ${{ secrets.MINIKUBE_BOT_PAT }}
commit-message: 'Addon amd-gpu-device-plugin: Update amd/k8s-device-plugin image from ${{ steps.bumpAmdDevicePlugin.outputs.OLD_VERSION }} to ${{ steps.bumpAmdDevicePlugin.outputs.NEW_VERSION }}'
committer: minikube-bot <[email protected]>
author: minikube-bot <[email protected]>
branch: auto_bump_amd_device_plugin_version
push-to-fork: minikube-bot/minikube
base: master
delete-branch: true
title: 'Addon amd-gpu-device-plugin: Update amd/k8s-device-plugin image from ${{ steps.bumpAmdDevicePlugin.outputs.OLD_VERSION }} to ${{ steps.bumpAmdDevicePlugin.outputs.NEW_VERSION }}'
labels: ok-to-test
body: |
The [k8s-device-plugin](https://github.com/ROCm/k8s-device-plugin) project released a new k8s-device-plugin image
This PR was auto-generated by `make update-amd-gpu-device-plugin-version` using [update-amd-gpu-device-plugin-version.yml](https://github.com/kubernetes/minikube/tree/master/.github/workflows/update-amd-gpu-device-plugin-version.yml) CI Workflow.
5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -1222,6 +1222,11 @@ update-nvidia-device-plugin-version:
(cd hack/update/nvidia_device_plugin_version && \
go run update_nvidia_device_plugin_version.go)

.PHONY: update-amd-gpu-device-plugin-version
update-amd-gpu-device-plugin-version:
(cd hack/update/amd_device_plugin_version && \
go run update_amd_device_plugin_version.go)

.PHONY: update-nerctld-version
update-nerdctld-version:
(cd hack/update/nerdctld_version && \
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ As well as developer-friendly features:

* [Addons](https://minikube.sigs.k8s.io/docs/handbook/deploying/#addons) - a marketplace for developers to share configurations for running services on minikube
* [NVIDIA GPU support](https://minikube.sigs.k8s.io/docs/tutorials/nvidia/) - for machine learning
* [AMD GPU support](https://minikube.sigs.k8s.io/docs/tutorials/amd/) - for machine learning
* [Filesystem mounts](https://minikube.sigs.k8s.io/docs/handbook/mount/)

**For more information, see the official [minikube website](https://minikube.sigs.k8s.io)**
Expand Down
4 changes: 2 additions & 2 deletions cmd/minikube/cmd/start.go
Original file line number Diff line number Diff line change
Expand Up @@ -1462,8 +1462,8 @@ func validateGPUs(value, drvName, rtime string) error {
if err := validateGPUsArch(); err != nil {
return err
}
if value != "nvidia" && value != "all" {
return errors.Errorf(`The gpus flag must be passed a value of "nvidia" or "all"`)
if value != "nvidia" && value != "all" && value != "amd" {
return errors.Errorf(`The gpus flag must be passed a value of "nvidia", "amd" or "all"`)
}
if drvName == constants.Docker && (rtime == constants.Docker || rtime == constants.DefaultContainerRuntime) {
return nil
Expand Down
2 changes: 1 addition & 1 deletion cmd/minikube/cmd/start_flags.go
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ func initMinikubeFlags() {
startCmd.Flags().Bool(disableOptimizations, false, "If set, disables optimizations that are set for local Kubernetes. Including decreasing CoreDNS replicas from 2 to 1. Defaults to false.")
startCmd.Flags().Bool(disableMetrics, false, "If set, disables metrics reporting (CPU and memory usage), this can improve CPU usage. Defaults to false.")
startCmd.Flags().String(staticIP, "", "Set a static IP for the minikube cluster, the IP must be: private, IPv4, and the last octet must be between 2 and 254, for example 192.168.200.200 (Docker and Podman drivers only)")
startCmd.Flags().StringP(gpus, "g", "", "Allow pods to use your NVIDIA GPUs. Options include: [all,nvidia] (Docker driver with Docker container-runtime only)")
startCmd.Flags().StringP(gpus, "g", "", "Allow pods to use your GPUs. Options include: [all,nvidia,amd] (Docker driver with Docker container-runtime only)")
startCmd.Flags().Duration(autoPauseInterval, time.Minute*1, "Duration of inactivity before the minikube VM is paused (default 1m0s)")
}

Expand Down
5 changes: 4 additions & 1 deletion cmd/minikube/cmd/start_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -814,7 +814,10 @@ func TestValidateGPUs(t *testing.T) {
{"nvidia", "docker", "", ""},
{"all", "kvm", "docker", "The gpus flag can only be used with the docker driver and docker container-runtime"},
{"nvidia", "docker", "containerd", "The gpus flag can only be used with the docker driver and docker container-runtime"},
{"cat", "docker", "docker", `The gpus flag must be passed a value of "nvidia" or "all"`},
{"cat", "docker", "docker", `The gpus flag must be passed a value of "nvidia", "amd" or "all"`},
{"amd", "docker", "docker", ""},
{"amd", "docker", "", ""},
{"amd", "docker", "containerd", "The gpus flag can only be used with the docker driver and docker container-runtime"},
}

for _, tc := range tests {
Expand Down
4 changes: 4 additions & 0 deletions deploy/addons/assets.go
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,10 @@ var (
//go:embed gpu/nvidia-gpu-device-plugin.yaml.tmpl
NvidiaGpuDevicePluginAssets embed.FS

// AmdGpuDevicePluginAssets assets for amd-gpu-device-plugin addon
//go:embed gpu/amd-gpu-device-plugin.yaml.tmpl
AmdGpuDevicePluginAssets embed.FS

// LogviewerAssets assets for logviewer addon
//go:embed logviewer/*.tmpl logviewer/*.yaml
LogviewerAssets embed.FS
Expand Down
60 changes: 60 additions & 0 deletions deploy/addons/gpu/amd-gpu-device-plugin.yaml.tmpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Copyright 2024 The Kubernetes Authors All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: amd-gpu-device-plugin
namespace: kube-system
labels:
k8s-app: amd-gpu-device-plugin
kubernetes.io/minikube-addons: amd-gpu-device-plugin
addonmanager.kubernetes.io/mode: Reconcile
spec:
selector:
matchLabels:
k8s-app: amd-gpu-device-plugin
template:
metadata:
labels:
name: amd-gpu-device-plugin
k8s-app: amd-gpu-device-plugin
spec:
nodeSelector:
kubernetes.io/arch: amd64
fbyrne marked this conversation as resolved.
Show resolved Hide resolved
priorityClassName: system-node-critical
tolerations:
- key: CriticalAddonsOnly
operator: Exists
volumes:
- name: dp
hostPath:
path: /var/lib/kubelet/device-plugins
- name: sys
hostPath:
path: /sys
containers:
- image: {{.CustomRegistries.AmdDevicePlugin | default .ImageRepository | default .Registries.AmdDevicePlugin }}{{.Images.AmdDevicePlugin}}
name: amd-gpu-device-plugin
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: dp
mountPath: /var/lib/kubelet/device-plugins
- name: sys
mountPath: /sys
updateStrategy:
type: RollingUpdate
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
/*
Copyright 2024 The Kubernetes Authors All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package main

import (
"context"
"fmt"
"time"

"k8s.io/klog/v2"
"k8s.io/minikube/hack/update"
)

var schema = map[string]update.Item{
"pkg/minikube/assets/addons.go": {
Replace: map[string]string{
`rocm/k8s-device-plugin:.*`: `rocm/k8s-device-plugin:{{.Version}}@{{.SHA}}",`,
},
},
}

type Data struct {
Version string
SHA string
}

func main() {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()

stable, _, _, err := update.GHReleases(ctx, "ROCm", "k8s-device-plugin")
if err != nil {
klog.Fatalf("Unable to get stable version: %v", err)
}
sha, err := update.GetImageSHA(fmt.Sprintf("rocm/k8s-device-plugin:%s", stable.Tag))
if err != nil {
klog.Fatalf("failed to get image SHA: %v", err)
}

data := Data{Version: stable.Tag, SHA: sha}

update.Apply(schema, data)
}
1 change: 1 addition & 0 deletions hack/update/get_version/get_version.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ type dependency struct {
}

var dependencies = map[string]dependency{
"amd-gpu-device-plugin": {addonsFile, `rocm/k8s-device-plugin:(.*)@`},
"buildkit": {"deploy/iso/minikube-iso/arch/x86_64/package/buildkit-bin/buildkit-bin.mk", `BUILDKIT_BIN_VERSION = (.*)`},
"calico": {"pkg/minikube/bootstrapper/images/images.go", `calicoVersion = "(.*)"`},
"cilium": {"pkg/minikube/cni/cilium.yaml", `quay.io/cilium/cilium:(.*)@`},
Expand Down
5 changes: 5 additions & 0 deletions pkg/addons/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,11 @@ var Addons = []*Addon{
validations: []setFn{isKVMDriverForNVIDIA},
callbacks: []setFn{EnableOrDisableAddon},
},
{
name: "amd-gpu-device-plugin",
set: SetBool,
callbacks: []setFn{EnableOrDisableAddon},
},
{
name: "olm",
set: SetBool,
Expand Down
8 changes: 7 additions & 1 deletion pkg/drivers/kic/oci/oci.go
Original file line number Diff line number Diff line change
Expand Up @@ -190,8 +190,14 @@ func CreateContainerNode(p CreateParams) error { //nolint to suppress cyclomatic
runArgs = append(runArgs, "--network", p.Network)
runArgs = append(runArgs, "--ip", p.IP)
}
if p.GPUs != "" {

if p.GPUs == "all" || p.GPUs == "nvidia" {
runArgs = append(runArgs, "--gpus", "all", "--env", "NVIDIA_DRIVER_CAPABILITIES=all")
} else if p.GPUs == "amd" {
/* https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html
* "--security-opt seccomp=unconfined" is also required but included above.
*/
runArgs = append(runArgs, "--device", "/dev/kfd", "--device", "/dev/dri", "--group-add", "video", "--group-add", "render")
}

memcgSwap := hasMemorySwapCgroup()
Expand Down
2 changes: 1 addition & 1 deletion pkg/drivers/kic/oci/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ type CreateParams struct {
OCIBinary string // docker or podman
Network string // network name that the container will attach to
IP string // static IP to assign the container in the cluster network
GPUs string // add NVIDIA GPU devices to the container
GPUs string // add GPU devices to the container
}

// createOpt is an option for Create
Expand Down
2 changes: 1 addition & 1 deletion pkg/drivers/kic/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -69,5 +69,5 @@ type Config struct {
StaticIP string // static IP for the kic cluster
ExtraArgs []string // a list of any extra option to pass to oci binary during creation time, for example --expose 8080...
ListenAddress string // IP Address to listen to
GPUs string // add NVIDIA GPU devices to the container
GPUs string // add GPU devices to the container
}
11 changes: 11 additions & 0 deletions pkg/minikube/assets/addons.go
Original file line number Diff line number Diff line change
Expand Up @@ -487,6 +487,17 @@ var Addons = map[string]*Addon{
}, map[string]string{
"NvidiaDevicePlugin": "registry.k8s.io",
}),
"amd-gpu-device-plugin": NewAddon([]*BinAsset{
MustBinAsset(addons.AmdGpuDevicePluginAssets,
"gpu/amd-gpu-device-plugin.yaml.tmpl",
vmpath.GuestAddonsDir,
"amd-gpu-device-plugin.yaml",
"0640"),
}, false, "amd-gpu-device-plugin", "3rd party (AMD)", "", "https://minikube.sigs.k8s.io/docs/tutorials/amd/", map[string]string{
"AmdDevicePlugin": "rocm/k8s-device-plugin:1.25.2.8@sha256:f3835498cf2274e0a07c32b38c166c05a876f8eb776d756cc06805e599a3ba5f",
}, map[string]string{
"AmdDevicePlugin": "docker.io",
}),
"logviewer": NewAddon([]*BinAsset{
MustBinAsset(addons.LogviewerAssets,
"logviewer/logviewer-dp-and-svc.yaml.tmpl",
Expand Down
2 changes: 1 addition & 1 deletion pkg/minikube/cruntime/cruntime.go
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ type Config struct {
// InsecureRegistry list of insecure registries
InsecureRegistry []string
// GPUs add GPU devices to the container
GPUs bool
GPUs string
}

// ListContainersOptions are the options to use for listing containers
Expand Down
8 changes: 6 additions & 2 deletions pkg/minikube/cruntime/docker.go
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ type Docker struct {
Init sysinit.Manager
UseCRI bool
CRIService string
GPUs bool
GPUs string
}

// Name is a human readable name for Docker
Expand Down Expand Up @@ -580,13 +580,17 @@ func (r *Docker) configureDocker(driver string) error {
},
StorageDriver: "overlay2",
}
if r.GPUs {

if r.GPUs == "all" || r.GPUs == "nvidia" {
assets.Addons["nvidia-device-plugin"].EnableByDefault()
daemonConfig.DefaultRuntime = "nvidia"
runtimes := &dockerDaemonRuntimes{}
runtimes.Nvidia.Path = "/usr/bin/nvidia-container-runtime"
daemonConfig.Runtimes = runtimes
} else if r.GPUs == "amd" {
assets.Addons["amd-gpu-device-plugin"].EnableByDefault()
}

daemonConfigBytes, err := json.Marshal(daemonConfig)
if err != nil {
return err
Expand Down
2 changes: 1 addition & 1 deletion pkg/minikube/node/start.go
Original file line number Diff line number Diff line change
Expand Up @@ -419,7 +419,7 @@ func configureRuntimes(runner cruntime.CommandRunner, cc config.ClusterConfig, k
InsecureRegistry: cc.InsecureRegistry,
}
if cc.GPUs != "" {
co.GPUs = true
co.GPUs = cc.GPUs
}
cr, err := cruntime.New(co)
if err != nil {
Expand Down
2 changes: 1 addition & 1 deletion site/content/en/docs/commands/start.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ minikube start [flags]
--feature-gates string A set of key=value pairs that describe feature gates for alpha/experimental features.
--force Force minikube to perform possibly dangerous operations
--force-systemd If set, force the container runtime to use systemd as cgroup manager. Defaults to false.
-g, --gpus string Allow pods to use your NVIDIA GPUs. Options include: [all,nvidia] (Docker driver with Docker container-runtime only)
-g, --gpus string Allow pods to use your GPUs. Options include: [all,nvidia,amd] (Docker driver with Docker container-runtime only)
--ha Create Highly Available Multi-Control Plane Cluster with a minimum of three control-plane nodes that will also be marked for work.
--host-dns-resolver Enable host resolver for NAT DNS requests (virtualbox driver only) (default true)
--host-only-cidr string The CIDR to be used for the minikube VM (virtualbox driver only) (default "192.168.59.1/24")
Expand Down
3 changes: 3 additions & 0 deletions site/content/en/docs/contrib/tests.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,9 @@ tests disabling an addon on a non-existing cluster
#### validateNvidiaDevicePlugin
tests the nvidia-device-plugin addon by ensuring the pod comes up and the addon disables

#### validateAmdGpuDevicePlugin
tests the amd-gpu-device-plugin addon by ensuring the pod comes up and the addon disables

#### validateYakdAddon

## TestCertOptions
Expand Down
Loading
Loading