question about parallelism for embedding #2119

imh966 · 2024-06-17T02:45:52Z

It seems torchrec does not support the combination of data parallelism and row-wise parallelism for embedding. I want to know is there a plan on it? Or is row-wise parallelism efficient enough when it comes to multi-node training?

iamzainhuda · 2024-06-26T17:19:07Z

If I understand correctly you want to data parallel row wise shards for an embedding? AFAIU, this is seems like a niche case and not sure as to if it brings gains over the current supported sharding schemes. Usually RW/CW sharding is efficient for multi node training

imh966 · 2024-07-04T09:23:43Z

Thanks for your reply and I've got what you mean. But I think when it comes to massive training, such as hundreds of GPUs, RW/CW probably make the embedding tables in a single GPU too small. In this case, could DP+RW/CW be a better way? Or just use TW+RW/CW?

iamzainhuda · 2024-11-01T21:31:24Z

Sorry for the late reply here, we recently added a GRID_SHARD type for massive training which is both ROW and COLUMN wise sharding. Depending on how big the embedding tables this can be more efficient for massive training.

For your case I think DP+(RW/CW) seems best - I'm sure by now you've come up with something we have something coming for this type officially in the next month akin to multi level parallelism

imh966 · 2024-11-04T02:36:55Z

That sounds great! By the way, I found that GRID_SHARD type is only supported to apply on EmbeddingCollectionBag but not EmbeddingCollection. In my case, I mainly use EmbeddingCollection, so I really want to know whether your new multi level parallelism is for both EmbeddingCollection and EmbeddingCollectionBag.

iamzainhuda · 2024-11-04T22:57:51Z

Yes for the multi level parallelism, it will support both EmbeddingCollection and EmbeddingCollectionBag. It is applied at the model level, meaning EC/EBC are supported as well as all the sharding types for the emebedding tables.

iamzainhuda · 2024-11-12T23:16:00Z

@imh966, we've just published our first pass: #2554

take a look and let me know what you think!

imh966 · 2024-11-13T07:22:47Z

@imh966, we've just published our first pass: #2554

take a look and let me know what you think!

it seems to me that it's a good method for massive training and I believe it's more efficient. However, your method applies allreduce on weights of embeddings rather than gradients and I think the latter one is more prevalent. Furthermore, if some complex optimizers like Adam are used for embeddings, these two methods may yield different results. I'm not sure whether this discrepancy would affect model's performance. Are there any theories that support this method?

iamzainhuda · 2024-11-13T21:03:33Z

Yeah great catch - we've gone the embedding weights way instead of gradients due to a FBGEMM implementation detail. FBGEMM fuses the optimizer update in the backward, so if we wanted to gradient sync instead we would lose quite a bit of performance and incur a much larger memory overhead. Which means it's not truly "equivalent" training to a non 2D scheme. Some tuning of the optimizer is required, which I'm hoping to share more once ready.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about parallelism for embedding #2119

question about parallelism for embedding #2119

imh966 commented Jun 17, 2024

iamzainhuda commented Jun 26, 2024

imh966 commented Jul 4, 2024

iamzainhuda commented Nov 1, 2024 •

edited

Loading

imh966 commented Nov 4, 2024

iamzainhuda commented Nov 4, 2024

iamzainhuda commented Nov 12, 2024

imh966 commented Nov 13, 2024

iamzainhuda commented Nov 13, 2024

question about parallelism for embedding #2119

question about parallelism for embedding #2119

Comments

imh966 commented Jun 17, 2024

iamzainhuda commented Jun 26, 2024

imh966 commented Jul 4, 2024

iamzainhuda commented Nov 1, 2024 • edited Loading

imh966 commented Nov 4, 2024

iamzainhuda commented Nov 4, 2024

iamzainhuda commented Nov 12, 2024

imh966 commented Nov 13, 2024

iamzainhuda commented Nov 13, 2024

iamzainhuda commented Nov 1, 2024 •

edited

Loading