Difference Between DDP and FSDP Modes #762

lllabmaster · 2024-12-06T06:06:42Z

❓ The question

Hi, I encountered the following issue:

I trained a 1B model from scratch using the official OLMo_1B.yaml configuration on 4 GPUs (one node). I modified the global_batch_size to 256 and the macro_batch_size to 2.

Distributed Training Configurations and Results:
DDP Mode
Distributed Strategy: ddp
DDP Settings:

grad_sync_mode: batch  
find_unused_params: false

Throughput per Device: 8,034 tokens/device/second

FSDP Mode
Distributed Strategy: fsdp
FSDP Settings:

wrapping_strategy: null  
precision: mixed

Throughput per Device: 1,747 tokens/device/second

FSDP Mode
Distributed Strategy: fsdp
FSDP Settings:

wrapping_strategy: by_block_and_size  
precision: mixed  
sharding_strategy: SHARD_GRAD_OP

Throughput per Device: 1,790 tokens/second

Other settings remained the same across configurations.

Question:

Why is there such a significant difference in throughput between DDP and FSDP modes?

The text was updated successfully, but these errors were encountered:

lllabmaster added the type/question An issue that's a question label Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference Between DDP and FSDP Modes #762

Difference Between DDP and FSDP Modes #762

lllabmaster commented Dec 6, 2024

Difference Between DDP and FSDP Modes #762

Difference Between DDP and FSDP Modes #762

Comments

lllabmaster commented Dec 6, 2024

❓ The question