diff --git a/docs/proposals/2170-kubeflow-training-v2/README.md b/docs/proposals/2170-kubeflow-training-v2/README.md index 983c706db6..f3628ec60c 100644 --- a/docs/proposals/2170-kubeflow-training-v2/README.md +++ b/docs/proposals/2170-kubeflow-training-v2/README.md @@ -178,11 +178,11 @@ the following table explains the naming that each framework or technology uses: Slot

-(--np) +(-n) Node

-(--host) +(-host) mpirun @@ -1604,7 +1604,36 @@ spec: #### MPI Runtime -For MPI, we can add support for the `DeepSpeed` runtimes. +We will re-use [the MPI Operator V2](https://github.com/kubeflow/mpi-operator/blob/master/proposals/scalable-robust-operator.md) +functionality as part of this MPI Runtime. Which means we will use the SSH-based approach to +initialize the MPI Job. + +The MPI Plugin in Training Operator will be responsible to: + +- Build the Secret with the SSH keys. +- Build the ConfigMap with the appropriate hostfile for OpenMPI, IntelMPI, or MPICH. We will support + only **OpenMPI** in the first implementation. + +The Secret and ConfigMap will be added to the corresponding JobSet. + +The hostfile default location is `/etc/mpi/hostfile`. For example, for OpenMPI we configure this +env variable: + +```bash +OMPI_MCA_orte_default_hostfile=/etc/mpi/hostfile +``` + +The `numProcPerNode` is equal to the number of slots in the MPI hostfile. + +Example of hostfile: + +``` +deepspeed-trainer-node-0-0.default.svc slots=5 +deepspeed-trainer-node-0-1.default.svc slots=5 +``` + +Initially, we will introduce support for [distributed MLX](https://ml-explore.github.io/mlx/build/html/usage/distributed.html) +and [DeepSpeed](https://www.deepspeed.ai/training/) using the MPI Runtime. Example of simple OpenMPI runtime: @@ -1612,16 +1641,19 @@ Example of simple OpenMPI runtime: apiVersion: kubeflow.org/v2alpha1 kind: ClusterTrainingRuntime metadata: - name: mpi-simple + name: deepspeed + namespace: default spec: mlPolicy: - numNodes: 5 + numNodes: 2 mpi: mpiImplementation: OpenMPI numProcPerNode: 5 template: + startupPolicy: + startupPolicyOrder: InOrder replicatedJobs: - - name: Launcher + - name: launcher template: spec: template: @@ -1630,17 +1662,18 @@ spec: - name: mpi-launcher image: docker.io/mpi-launch command: - - mpirun -np 5 --host mpi-simple.default.svc - - name: Node + - mpirun launch-job + - name: trainer-node template: spec: template: spec: containers: - name: trainer - image: docker.io/mpi-training - command: - - mpirun -np 2 train.py + image: docker.io/deepspeed-trainer + resources: + limits: + nvidia.com/gpu: 5 ``` #### TensorFlow Runtime