You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I trained a 1B model from scratch using the official OLMo_1B.yaml configuration on 4 GPUs (one node). I modified the global_batch_size to 256 and the macro_batch_size to 2.
Distributed Training Configurations and Results:
DDP Mode
Distributed Strategy: ddp
DDP Settings:
❓ The question
Hi, I encountered the following issue:
I trained a 1B model from scratch using the official OLMo_1B.yaml configuration on 4 GPUs (one node). I modified the global_batch_size to 256 and the macro_batch_size to 2.
Distributed Training Configurations and Results:
DDP Mode
Distributed Strategy: ddp
DDP Settings:
Throughput per Device: 8,034 tokens/device/second
FSDP Mode
Distributed Strategy: fsdp
FSDP Settings:
Throughput per Device: 1,747 tokens/device/second
FSDP Mode
Distributed Strategy: fsdp
FSDP Settings:
Throughput per Device: 1,790 tokens/second
Other settings remained the same across configurations.
Question:
Why is there such a significant difference in throughput between DDP and FSDP modes?
The text was updated successfully, but these errors were encountered: