diff --git a/docs/proposals/2170-kubeflow-training-v2/README.md b/docs/proposals/2170-kubeflow-training-v2/README.md index 0f2c3a2cc1..46c5d1054a 100644 --- a/docs/proposals/2170-kubeflow-training-v2/README.md +++ b/docs/proposals/2170-kubeflow-training-v2/README.md @@ -102,9 +102,10 @@ We propose these APIs: to configure infrastructure parameters that are required for the **TrainJob**. For example, failure policy or gang-scheduling. -The below diagram shows which resources will be created for LLM fine-tuning with PyTorch. +The below diagram shows that platform engineers manage `TrainingRuntime` and data scientists create +`TrainJob`: -![trainjob-diagram](./trainjob-diagram.jpg) +![user-roles](./user-roles.drawio.svg) ### Worker and Node Definition @@ -409,6 +410,10 @@ spec: path: custom-datasets/yelp-review ``` +The below diagram shows which resources will be created for LLM fine-tuning with PyTorch: + +![trainjob-diagram](./trainjob-diagram.drawio.svg) + ### The Trainer Config API The `TrainerConfig` represents the APIs that data scientists can use to configure trainer settings: diff --git a/docs/proposals/2170-kubeflow-training-v2/trainjob-diagram.drawio.svg b/docs/proposals/2170-kubeflow-training-v2/trainjob-diagram.drawio.svg new file mode 100644 index 0000000000..74e041dca0 --- /dev/null +++ b/docs/proposals/2170-kubeflow-training-v2/trainjob-diagram.drawio.svg @@ -0,0 +1,4 @@ + + + +
TrainJob
image/svg+xml
JobSet
image/svg+xml
Storage
Initializer
PyTorch
Nodes
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
PyTorch Nodes
Headless Service
Download
Pre-trained Model
Download
Dataset
LLM Training Runtime
image/svg+xml
\ No newline at end of file diff --git a/docs/proposals/2170-kubeflow-training-v2/trainjob-diagram.jpg b/docs/proposals/2170-kubeflow-training-v2/trainjob-diagram.jpg deleted file mode 100644 index 08ab3508a1..0000000000 Binary files a/docs/proposals/2170-kubeflow-training-v2/trainjob-diagram.jpg and /dev/null differ diff --git a/docs/proposals/2170-kubeflow-training-v2/user-roles.drawio.svg b/docs/proposals/2170-kubeflow-training-v2/user-roles.drawio.svg new file mode 100644 index 0000000000..e673c037bc --- /dev/null +++ b/docs/proposals/2170-kubeflow-training-v2/user-roles.drawio.svg @@ -0,0 +1,4 @@ + + + +
Platform Engineer
Kubeflow
Python SDK
TrainJob
image/svg+xml
kubectl
Data Scientist
Create TrainJob
JobSet
image/svg+xml
Training Runtime
image/svg+xml
Training
Nodes
Headless Service
Manage
Runtime
Fetch
Spec
\ No newline at end of file