Skip to content

Distributed PyTorch

Albert Zeyer edited this page Jun 26, 2024 · 5 revisions

Distributed PyTorch will be enabled when torch_distributed is set in the config.

Example RETURNN setting: Just put torch_distributed = {} into the config. This will use PyTorch DistributedDataParallel. See PyTorch distributed overview or Getting started with distributed data parallel for details how this works and what PyTorch, NCCL or other relevant settings there are.

Or use torch_distributed = {"reduce_type": "param", "param_sync_step": 100}, to use parameter averaging instead of gradient accumulation, i.e. to sync the parameters after every 100 steps.

Maybe use DistributeFilesDataset for an efficient dataset supporting large scale data and training.

In i6_core ReturnnTrainingJob, set horovod_num_processes (name is confusing, it's not about Horovod anymore but also applies to other distribution frameworks) to the number of processes, and distributed_launch_cmd = "torchrun".

We call init_process_group(backend=None) which will by default enable both the Gloo backend and the NCCL backend. The Gloo backend will be used for CPU tensors and the NCCL backend will be used for GPU tensors. (I think) when NCCL fails, it will also fallback to Gloo for GPU tensors.

See NCCL env vars. E.g. NCCL_DEBUG=INFO could be useful.