Skip to content

Commit

Permalink
Add DeepSpeed Example with Pytorch Operator
Browse files Browse the repository at this point in the history
Signed-off-by: Syulin7 <[email protected]>

# Conflicts:
#	.github/workflows/publish-example-images.yaml
  • Loading branch information
Syulin7 committed Oct 1, 2024
1 parent 12d09d0 commit 7fa309c
Show file tree
Hide file tree
Showing 6 changed files with 931 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .github/workflows/publish-example-images.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,7 @@ jobs:
platforms: linux/amd64,linux/arm64
dockerfile: examples/jax/cpu-demo/Dockerfile
context: examples/jax/cpu-demo
- component-name: pytorch-deepspeed-demo
platforms: linux/amd64
dockerfile: examples/pytorch/deepspeed-demo/Dockerfile
context: examples/pytorch/deepspeed-demo
11 changes: 11 additions & 0 deletions examples/pytorch/deepspeed-demo/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
FROM deepspeed/deepspeed:v072_torch112_cu117

RUN apt update
RUN apt install -y ninja-build

WORKDIR /
COPY requirements.txt .
COPY train_bert_ds.py .

RUN pip install -r requirements.txt
RUN mkdir -p /root/deepspeed_data
37 changes: 37 additions & 0 deletions examples/pytorch/deepspeed-demo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
## Training a Masked Language Model with PyTorch and DeepSpeed

This folder contains an example of training a Masked Language Model with PyTorch and DeepSpeed.

The python script used to train BERT with PyTorch and DeepSpeed. For more information, please refer to the [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/blob/master/training/HelloDeepSpeed/README.md).

DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate.
See [deepspeed](https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#deployment).

This guide will show you how to deploy DeepSpeed with the `torchrun` launcher.
The simplest way to quickly reproduce the following is to switch to the DeepSpeedExamples commit:
```shell
git clone https://github.com/microsoft/DeepSpeedExamples.git
cd DeepSpeedExamples
git checkout efacebb
```

The script train_bert_ds.py is located in the DeepSpeedExamples/HelloDeepSpeed/ directory.
Since the script is not launched using the deepspeed launcher, it needs to read the local_rank from the environment.
The following content has been added at line 670:
```
local_rank = int(os.getenv('LOCAL_RANK', '-1'))
```

### Build Image

The default image name and tag is `kubeflow/pytorch-deepspeed-demo:latest`.

```shell
docker build -f Dockerfile -t kubeflow/pytorch-deepspeed-demo:latest ./
```

### Create the PyTorchJob with DeepSpeed example

```shell
kubectl create -f pytorch_deepspeed_demo.yaml
```
42 changes: 42 additions & 0 deletions examples/pytorch/deepspeed-demo/pytorch_deepspeed_demo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: pytorch-deepspeed-demo
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: kubeflow/pytorch-deepspeed-demo:latest
command:
- torchrun
- --nnodes=2
- --nproc_per_node=1
- /train_bert_ds.py
- --checkpoint_dir
- /root/deepspeed_data
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: kubeflow/pytorch-deepspeed-demo:latest
command:
- torchrun
- --nnodes=2
- --nproc_per_node=1
- /train_bert_ds.py
- --checkpoint_dir
- /root/deepspeed_data
resources:
limits:
nvidia.com/gpu: 1
8 changes: 8 additions & 0 deletions examples/pytorch/deepspeed-demo/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
datasets==1.13.3
transformers==4.5.1
fire==0.4.0
pytz==2021.1
loguru==0.5.3
sh==1.14.2
pytest==6.2.5
tqdm==4.62.3
Loading

0 comments on commit 7fa309c

Please sign in to comment.