Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about parallelism for embedding #2119

Open
imh966 opened this issue Jun 17, 2024 · 8 comments
Open

question about parallelism for embedding #2119

imh966 opened this issue Jun 17, 2024 · 8 comments

Comments

@imh966
Copy link

imh966 commented Jun 17, 2024

It seems torchrec does not support the combination of data parallelism and row-wise parallelism for embedding. I want to know is there a plan on it? Or is row-wise parallelism efficient enough when it comes to multi-node training?

@iamzainhuda
Copy link
Contributor

If I understand correctly you want to data parallel row wise shards for an embedding? AFAIU, this is seems like a niche case and not sure as to if it brings gains over the current supported sharding schemes. Usually RW/CW sharding is efficient for multi node training

@imh966
Copy link
Author

imh966 commented Jul 4, 2024

Thanks for your reply and I've got what you mean. But I think when it comes to massive training, such as hundreds of GPUs, RW/CW probably make the embedding tables in a single GPU too small. In this case, could DP+RW/CW be a better way? Or just use TW+RW/CW?

@iamzainhuda
Copy link
Contributor

iamzainhuda commented Nov 1, 2024

Sorry for the late reply here, we recently added a GRID_SHARD type for massive training which is both ROW and COLUMN wise sharding. Depending on how big the embedding tables this can be more efficient for massive training.

For your case I think DP+(RW/CW) seems best - I'm sure by now you've come up with something we have something coming for this type officially in the next month akin to multi level parallelism

@imh966
Copy link
Author

imh966 commented Nov 4, 2024

That sounds great! By the way, I found that GRID_SHARD type is only supported to apply on EmbeddingCollectionBag but not EmbeddingCollection. In my case, I mainly use EmbeddingCollection, so I really want to know whether your new multi level parallelism is for both EmbeddingCollection and EmbeddingCollectionBag.

@iamzainhuda
Copy link
Contributor

Yes for the multi level parallelism, it will support both EmbeddingCollection and EmbeddingCollectionBag. It is applied at the model level, meaning EC/EBC are supported as well as all the sharding types for the emebedding tables.

@iamzainhuda
Copy link
Contributor

@imh966, we've just published our first pass: #2554

take a look and let me know what you think!

@imh966
Copy link
Author

imh966 commented Nov 13, 2024

@imh966, we've just published our first pass: #2554

take a look and let me know what you think!

it seems to me that it's a good method for massive training and I believe it's more efficient. However, your method applies allreduce on weights of embeddings rather than gradients and I think the latter one is more prevalent. Furthermore, if some complex optimizers like Adam are used for embeddings, these two methods may yield different results. I'm not sure whether this discrepancy would affect model's performance. Are there any theories that support this method?

@iamzainhuda
Copy link
Contributor

Yeah great catch - we've gone the embedding weights way instead of gradients due to a FBGEMM implementation detail. FBGEMM fuses the optimizer update in the backward, so if we wanted to gradient sync instead we would lose quite a bit of performance and incur a much larger memory overhead. Which means it's not truly "equivalent" training to a non 2D scheme. Some tuning of the optimizer is required, which I'm hoping to share more once ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants