Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多机训练速度问题 #40

Open
kakaxi-liu opened this issue Jun 18, 2024 · 3 comments
Open

多机训练速度问题 #40

kakaxi-liu opened this issue Jun 18, 2024 · 3 comments

Comments

@kakaxi-liu
Copy link

我尝试在多机多卡上训练,发现耗时相比单机上要增加很多,想同训练环境下相比Deepspeed Ulysses耗时增加了三倍,而单机上却没有这个问题,请问是什么原因导致的呢?

@kakaxi-liu
Copy link
Author

是因为每次通信都会跨机器,而跨机器之间的通信比较慢?

@zhuzilin
Copy link
Owner

应该是的。然后比 Deepspeed Ulysses 慢一些也是合理的(慢 3 倍感觉有点多了...),因为 Deepspeed Ulysses 的计算更均衡,通信也更整(不过 Deepspeed Ulysses 受限于模型的 num head,不能扩展足够长的 context length)

@peiliu0408
Copy link

mark。
当前的版本是否支持单机(即node)为单位运行zig_zag_ring_attention,多机器之前只进行梯度的通讯,以缓解多机通讯代价太大的问题?当然支持的最长长度也就变短了。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants