Modify readme and deploy

fu7100 · Apr 8, 2021 · bef2a5e · bef2a5e
1 parent 61fcd8a
commit bef2a5e
Show file tree

Hide file tree

Showing 8 changed files with 58 additions and 66 deletions.
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -3,7 +3,7 @@ name: docker
 on:
   push:
     tags:
-      - v[0-9]+.[0-9]+.[0-9]+.[0-9]
+      - v[0-9]+.[0-9]+.[0-9]+.[0-9]+
 
 jobs:
 

diff --git a/README.md b/README.md
@@ -2,6 +2,8 @@
 
 [![build status](https://github.com/4paradigm/k8s-device-plugin/actions/workflows/build.yml/badge.svg)](https://github.com/4paradigm/k8s-device-plugin/actions/workflows/build.yml)
 
+[![docker pulls](https://img.shields.io/docker/pulls/4pdosc/k8s-device-plugin.svg)](https://hub.docker.com/r/4pdosc/k8s-device-plugin)
+
 * [**Slack Channel**](https://k8s-device-plugin.slack.com/archives/D01S9K5Q04D)
 * [**Discussions**](https://github.com/4paradigm/k8s-device-plugin/discussions)
 
@@ -43,7 +45,7 @@ Test Cases:
 | test id |     case      |   type    |         params          |
 | ------- | :-----------: | :-------: | :---------------------: |
 | 1.1     | Resnet-V2-50  | inference |  batch=50,size=346*346  |
-| 1.2     | Resnet-v2-50  | training  |  batch=20,size=346*346  |
+| 1.2     | Resnet-V2-50  | training  |  batch=20,size=346*346  |
 | 2.1     | Resnet-V2-152 | inference |  batch=10,size=256*256  |
 | 2.2     | Resnet-V2-152 | training  |  batch=10,size=256*256  |
 | 3.1     |    VGG-16     | inference |  batch=20,size=224*224  |
@@ -60,38 +62,16 @@ Test Result: ![img](./imgs/benchmark_inf.png)
 To reproduce:
 
 1. install vGPU-nvidia-device-plugin，and configure properly
-2. kubectl apply -f test.yml，with test.yml as follows:
+2. run benchmark job
 
 ```
- apiVersion: batch/v1
- kind: Job
- metadata:
-   name: ai-benchmark
- spec:
-   template:
-     metadata:
-       name: ai-benchmark
-       labels:
-         qa: test
-     spec:
-       toleration:
-       - key: node.kubernetes.io/disk-pressure
-       containers:
-       - name: testgpu
-         image: m7-ieg-pico-test01:5000/ai-benchmark:latest-gpu
-         command: ["python", "/ai-benchmark/bin/ai-benchmark.py"]
-         resources:
-           requests:
-             nvidia.com/gpu: 1
-           limits:
-             nvidia.com/gpu: 1
-       restartPolicy: Never
+$ kubectl apply -f benchmarks/ai-benchmark/ai-benchmark.yml
 ```
 
-1. View the result by using kubctl logs
+3. View the result by using kubctl logs
 
 ```
-     kubectl logs [pod id]
+$ kubectl logs [pod id]
 ```
 
 ## Features
@@ -166,7 +146,7 @@ Once you have configured the options above on all the GPU nodes in your
 cluster, remove existing NVIDIA device plugin for Kubernetes if it already exists. Then, you can download our Daemonset yaml file by following command:
 
 ```
-$ wget https://github.com/4paradigm/k8s-device-plugin/blob/vgpu/nvidia-device-plugin.yml
+$ wget https://raw.githubusercontent.com/4paradigm/k8s-device-plugin/master/nvidia-device-plugin.yml
 ```
 
 In this Daemonset file, you can see the container `nvidia-device-plugin-ctr` takes four optional arguments to customize your vGPU support:

diff --git a/README_cn.md b/README_cn.md
@@ -2,6 +2,8 @@
 
 [![build status](https://github.com/4paradigm/k8s-device-plugin/actions/workflows/build.yml/badge.svg)](https://github.com/4paradigm/k8s-device-plugin/actions/workflows/build.yml)
 
+[![docker pulls](https://img.shields.io/docker/pulls/4pdosc/k8s-device-plugin.svg)](https://hub.docker.com/r/4pdosc/k8s-device-plugin)
+
 * [**Slack Channel**](https://k8s-device-plugin.slack.com/archives/D01S9K5Q04D)
 * [**Discussions**](https://github.com/4paradigm/k8s-device-plugin/discussions)
 
@@ -25,7 +27,7 @@
 
 ## 关于
 
-**vGPU device plugin** 基于NVIDIA官方插件([NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin))，在保留官方功能的基础上，实现了对物理GPU进行切分，并对显存和计算单元进行限制，从而模拟出多张小的vGPU卡。在k8s集群中，基于这些切分后的vGPU进行调度，使不同的容器可以安全的共享同一张物理GPU，提高GPU的利用率。此外，插件可以对显存做一定的超售处理（使用到的显存可以大于物理上的显存），提高共享任务的数量，进一步提高GPU的利用率，可参考下面的性能测试报告。
+**vGPU device plugin** 基于NVIDIA官方插件([NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin))，在保留官方功能的基础上，实现了对物理GPU进行切分，并对显存和计算单元进行限制，从而模拟出多张小的vGPU卡。在k8s集群中，基于这些切分后的vGPU进行调度，使不同的容器可以安全的共享同一张物理GPU，提高GPU的利用率。此外，插件可以对显存做一定的超卖处理（使用到的显存可以大于物理上的显存），提高共享任务的数量，进一步提高GPU的利用率，可参考下面的性能测试报告。
 
 ## 性能测试
 
@@ -42,7 +44,7 @@
 | test id |     名称      |   类型    |          参数           |
 | ------- | :-----------: | :-------: | :---------------------: |
 | 1.1     | Resnet-V2-50  | inference |  batch=50,size=346*346  |
-| 1.2     | Resnet-v2-50  | training  |  batch=20,size=346*346  |
+| 1.2     | Resnet-V2-50  | training  |  batch=20,size=346*346  |
 | 2.1     | Resnet-V2-152 | inference |  batch=10,size=256*256  |
 | 2.2     | Resnet-V2-152 | training  |  batch=10,size=256*256  |
 | 3.1     |    VGG-16     | inference |  batch=20,size=224*224  |
@@ -58,38 +60,17 @@
 
 测试步骤：
 
-1. 安装nvidia-device-plugin，并配置相应的参数（虚拟比例，显存比例，若虚拟比例*显存比例>1则为超卖）
-2. kubectl apply -f test.yml，其中test.yml如下
+1. 安装nvidia-device-plugin，并配置相应的参数（虚拟比例，显存比例，若显存缩放比例>1则为超卖）
+2. 运行benchmark任务
 
 ```
- kind: Job
- metadata:
-   name: ai-benchmark
- spec:
-   template:
-     metadata:
-       name: ai-benchmark
-       labels:
-         qa: test
-     spec:
-       toleration:
-       - key: node.kubernetes.io/disk-pressure
-       containers:
-       - name: testgpu
-         image: m7-ieg-pico-test01:5000/ai-benchmark:latest-gpu
-         command: ["python", "/ai-benchmark/bin/ai-benchmark.py"]
-         resources:
-           requests:
-             nvidia.com/gpu: 1
-           limits:
-             nvidia.com/gpu: 1
-       restartPolicy: Never
+$ kubectl apply -f benchmarks/ai-benchmark/ai-benchmark.yml
 ```
 
-1. 通过kubctl logs 查看结果
+3. 通过kubctl logs 查看结果
 
 ```
- kubectl logs [pod id]
+$ kubectl logs [pod id]
 ```
 
 ## 功能
@@ -100,7 +81,7 @@
 
 ## 实验性功能
 
-- 显存超用
+- 显存超卖
 
   vGPU的显存总和可以超过GPU实际的显存，这时候超过的部分会放到内存里，对性能有一定的影响。
 
@@ -153,14 +134,14 @@ $ sudo systemctl restart docker
 }
 ```
 
-> *如果 `runtimes` 字段没有出现, 前往 [nvidia-docker]的安装页面执行安装操作(https://github.com/NVIDIA/nvidia-docker)*
+> *如果 `runtimes` 字段没有出现, 前往的安装页面执行安装操作 [nvidia-docker](https://github.com/NVIDIA/nvidia-docker)*
 
 ### Kubernetes开启vGPU支持
 
 当你在所有GPU节点完成前面提到的准备动作，如果Kubernetes有已经存在的NVIDIA装置插件，需要先将它移除。然后，你能通过下面指令下载我们的Daemonset yaml文件：
 
 ```
-$ wget https://github.com/4paradigm/k8s-device-plugin/blob/vgpu/nvidia-device-plugin.yml
+$ wget https://raw.githubusercontent.com/4paradigm/k8s-device-plugin/master/nvidia-device-plugin.yml
 ```
 
 在这个DaemonSet文件中, 你能发现`nvidia-device-plugin-ctr`容器有一共4个vGPU的客制化参数：
@@ -176,7 +157,7 @@ $ wget https://github.com/4paradigm/k8s-device-plugin/blob/vgpu/nvidia-device-pl
 
 完成这些可选参数的配置后，你能透过下面命令开启vGPU的支持：
 
-```shell
+```
 $ kubectl apply -f nvidia-device-plugin.yml
 ```
 

diff --git a/benchmarks/ai-benchmark/Dockerfile b/benchmarks/ai-benchmark/Dockerfile
@@ -0,0 +1,13 @@
+FROM tensorflow/tensorflow:2.4.1-gpu
+
+RUN apt-get update && apt-get install -y --no-install-recommends apt-utils
+
+RUN pip install --upgrade pip
+
+RUN apt-get -y install git
+RUN git clone -b feat/transformer https://github.com/shiyoubun/ai-benchmark.git
+
+WORKDIR ai-benchmark
+RUN pip install -e .
+
+ENTRYPOINT [ "python", "bin/ai-benchmark.py" ]
diff --git a/benchmarks/ai-benchmark/ai-benchmark.yml b/benchmarks/ai-benchmark/ai-benchmark.yml
@@ -0,0 +1,18 @@
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: ai-benchmark
+spec:
+  template:
+    metadata:
+      name: ai-benchmark
+    spec:
+      containers:
+        - name: ai-benchmark
+          image: 4pdosc/ai-benchmark:2.4.1-gpu
+          resources:
+            requests:
+              nvidia.com/gpu: 1
+            limits:
+              nvidia.com/gpu: 1
+      restartPolicy: Never
diff --git a/deployments/helm/nvidia-device-plugin/Chart.yaml b/deployments/helm/nvidia-device-plugin/Chart.yaml
@@ -2,7 +2,7 @@ apiVersion: v2
 name: nvidia-device-plugin
 type: application
 description: A Helm chart for the nvidia-device-plugin on Kubernetes
-version: "0.9.0"
-appVersion: "0.9.0"
+version: "0.9.0.0"
+appVersion: "0.9.0.0"
 kubeVersion: ">= 1.10.0-0"
-home: https://github.com/NVIDIA/k8s-device-plugin
+home: https://github.com/4paradigm/k8s-device-plugin
diff --git a/deployments/helm/nvidia-device-plugin/values.yaml b/deployments/helm/nvidia-device-plugin/values.yaml
@@ -17,7 +17,7 @@ namespace: kube-system
 
 imagePullSecrets: []
 image:
-  repository: 4pdosc/k8s-device-plugin
+  repository: 4pdosc/k8s-device-plugin:v0.9.0.0
   pullPolicy: IfNotPresent
   # Overrides the image tag whose default is the chart appVersion.
   tag: "latest"

diff --git a/nvidia-device-plugin.yml b/nvidia-device-plugin.yml
@@ -46,7 +46,7 @@ spec:
       # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
       priorityClassName: "system-node-critical"
       containers:
-      - image: 4pdosc/k8s-device-plugin:v0.9.0.0
+      - image: 4pdosc/k8s-device-plugin
         imagePullPolicy: Always
         name: nvidia-device-plugin-ctr
         args: ["--fail-on-init-error=false", "--device-split-count=4", "--device-memory-scaling=1.2", "--device-cores-scaling=1.2"]
-Original file line number
+Diff line change
@@ Expand Up / @@ -3,7 +3,7 @@ name: docker @@
     on:
       push:
         tags:
-          - v[0-9]+.[0-9]+.[0-9]+.[0-9]
+          - v[0-9]+.[0-9]+.[0-9]+.[0-9]+
     jobs:
@@ Expand Down @@