-
Notifications
You must be signed in to change notification settings - Fork 130
Distributed PyTorch
Distributed PyTorch will be enabled when torch_distributed
is set in the config.
Example RETURNN setting: Just put torch_distributed = {}
into the config. This will use PyTorch DistributedDataParallel
.
See PyTorch distributed overview
or Getting started with distributed data parallel
for details how this works and what PyTorch, NCCL or other relevant settings there are.
Or use torch_distributed = {"reduce_type": "param", "param_sync_step": 100}
,
to use parameter averaging instead of gradient accumulation, i.e. to sync the parameters after every 100 steps.
Maybe use DistributeFilesDataset
for an efficient dataset supporting large scale data and training.
In i6_core ReturnnTrainingJob
, set horovod_num_processes
(name is confusing, it's not about Horovod anymore but also applies to other distribution frameworks) to the number of processes, and distributed_launch_cmd = "torchrun"
.
We call init_process_group(backend=None)
which will by default enable both the Gloo backend and the NCCL backend. The Gloo backend will be used for CPU tensors and the NCCL backend will be used for GPU tensors. (I think) when NCCL fails, it will also fallback to Gloo for GPU tensors.
See NCCL env vars. E.g. NCCL_DEBUG=INFO
could be useful.