Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference Between DDP and FSDP Modes #762

Open
lllabmaster opened this issue Dec 6, 2024 · 0 comments
Open

Difference Between DDP and FSDP Modes #762

lllabmaster opened this issue Dec 6, 2024 · 0 comments
Labels
type/question An issue that's a question

Comments

@lllabmaster
Copy link

❓ The question

Hi, I encountered the following issue:

I trained a 1B model from scratch using the official OLMo_1B.yaml configuration on 4 GPUs (one node). I modified the global_batch_size to 256 and the macro_batch_size to 2.

Distributed Training Configurations and Results:
DDP Mode
Distributed Strategy: ddp
DDP Settings:

grad_sync_mode: batch  
find_unused_params: false

Throughput per Device: 8,034 tokens/device/second

FSDP Mode
Distributed Strategy: fsdp
FSDP Settings:

wrapping_strategy: null  
precision: mixed

Throughput per Device: 1,747 tokens/device/second

FSDP Mode
Distributed Strategy: fsdp
FSDP Settings:

wrapping_strategy: by_block_and_size  
precision: mixed  
sharding_strategy: SHARD_GRAD_OP  

Throughput per Device: 1,790 tokens/second

Other settings remained the same across configurations.

Question:

Why is there such a significant difference in throughput between DDP and FSDP modes?

@lllabmaster lllabmaster added the type/question An issue that's a question label Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question An issue that's a question
Projects
None yet
Development

No branches or pull requests

1 participant