Cannot work when use DATA_PARALLEL with FusedEmbeddingBagCollection #2209

imh966 · 2024-07-04T09:13:55Z

I am trying to apply DATA_PARALLEL on the small embedding tables and it can work in EmbeddingBagCollection. However, when it comes to FusedEmbeddingBagCollection, it doesn't work and gets an error on the second backward step. The error is like below:

[rank4]: Traceback (most recent call last):
[rank4]: File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
[rank4]: return _run_code(code, main_globals, None,
[rank4]: File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
[rank4]: exec(code, run_globals)
[rank4]: File "/workdir/gen_rec/rec/app/main.py", line 41, in
[rank4]: main()
[rank4]: File "/workdir/gen_rec/rec/app/main.py", line 24, in main
[rank4]: train(args)
[rank4]: File "/workdir/gen_rec/rec/app/task.py", line 17, in train
[rank4]: trainer.train()
[rank4]: File "/workdir/gen_rec/rec/training/trainer.py", line 278, in train
[rank4]: loss.backward()
[rank4]: File "/usr/local/lib/python3.9/site-packages/torch/_tensor.py", line 525, in backward
[rank4]: torch.autograd.backward(
[rank4]: File "/usr/local/lib/python3.9/site-packages/torch/autograd/init.py", line 267, in backward
[rank4]: _engine_run_backward(
[rank4]: File "/usr/local/lib/python3.9/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank4]: RuntimeError: Detected at least one parameter gradient is not the expected DDP bucket view with gradient_as_bucket_view=True. This may happen (for example) if multiple allreduce hooks were registered onto the same parameter. If you hit this error, please file an issue with a minimal repro.

Moreover, I've read the code of FUSED compute kernel and DENSE(DATA_PARALLEL) compute kernel. I find that there are some code about optimizer in FUSED compute kernel but there are none in DENSE compute kernel. It seems that DATA_PARALLEL is incompatible with FusedEmbeddingBagCollection.

I am not sure what really happened. Could anyone help me to address this problem?

JacoCheung · 2024-07-15T07:42:10Z

One note: Fused* are going to be deprecated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot work when use DATA_PARALLEL with FusedEmbeddingBagCollection #2209

Cannot work when use DATA_PARALLEL with FusedEmbeddingBagCollection #2209

imh966 commented Jul 4, 2024

JacoCheung commented Jul 15, 2024

Cannot work when use DATA_PARALLEL with FusedEmbeddingBagCollection #2209

Cannot work when use DATA_PARALLEL with FusedEmbeddingBagCollection #2209

Comments

imh966 commented Jul 4, 2024

JacoCheung commented Jul 15, 2024