SC 24 - Tutorial on Distributed Training of Deep Neural Networks

All the code for the hands-on exercies can be found in this repository.

Table of Contents

Setup
Basics of Model Training
Data Parallelism
Tensor Parallelism
Inference

Setup

To request an account on Zaratan, please join slack at the link above, and fill this Google form.

We have pre-built the dependencies required for this tutorial on Zaratan. This will be activated automatically when you run the bash scripts.

Model weights and the training dataset have been downloaded in /scratch/zt1/project/sc24/shared/.

Basics of Model Training

Using PyTorch Lightning

CONFIG_FILE=configs/single_gpu.json sbatch --ntasks-per-node=1  train.sh

Mixed Precision

Open configs/single_gpu.json and change precision to bf16-mixed and then run -

CONFIG_FILE=configs/single_gpu.json sbatch --ntasks-per-node=1  train.sh

Data Parallelism

Pytorch Distributed Data Parallel (DDP)

CONFIG_FILE=configs/ddp.json sbatch --ntasks-per-node=4  train.sh

Fully Sharded Data Parallelism (FSDP)

CONFIG_FILE=configs/fsdp.json sbatch --ntasks-per-node=4  train.sh

Tensor Parallelism

CONFIG_FILE=configs/axonn.json sbatch --ntasks-per-node=4  train.sh

Inference

Add more prompts to data/inference/prompts.txt if you want. Then run

CONFIG_FILE=configs/inference_axonn.json sbatch --ntasks-per-node=1  infer.sh

With torch.compile

Open configs/axonn_inference.json and change compile to true. Then run

CONFIG_FILE=configs/inference_axonn.json sbatch --ntasks-per-node=1  infer.sh

With tensor parallelism

Open configs/axonn_inference.json and change tp_dimensions to [4, 1, 1]. Then run

CONFIG_FILE=configs/inference_axonn.json sbatch --ntasks-per-node=4  infer.sh

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
configs		configs
data		data
external		external
slides		slides
.gitignore		.gitignore
README.md		README.md
args.py		args.py
infer.py		infer.py
infer.sh		infer.sh
requirements.txt		requirements.txt
train.py		train.py
train.sh		train.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SC 24 - Tutorial on Distributed Training of Deep Neural Networks

Setup

Basics of Model Training

Using PyTorch Lightning

Mixed Precision

Data Parallelism

Pytorch Distributed Data Parallel (DDP)

Fully Sharded Data Parallelism (FSDP)

Tensor Parallelism

Inference

With torch.compile

With tensor parallelism

About

Releases

Packages

Contributors 4

Languages

axonn-ai/distrib-dl-tutorial

Folders and files

Latest commit

History

Repository files navigation

SC 24 - Tutorial on Distributed Training of Deep Neural Networks

Setup

Basics of Model Training

Using PyTorch Lightning

Mixed Precision

Data Parallelism

Pytorch Distributed Data Parallel (DDP)

Fully Sharded Data Parallelism (FSDP)

Tensor Parallelism

Inference

With torch.compile

With tensor parallelism

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages