You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to apply DATA_PARALLEL on the small embedding tables and it can work in EmbeddingBagCollection. However, when it comes to FusedEmbeddingBagCollection, it doesn't work and gets an error on the second backward step. The error is like below:
[rank4]: Traceback (most recent call last):
[rank4]: File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
[rank4]: return _run_code(code, main_globals, None,
[rank4]: File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
[rank4]: exec(code, run_globals)
[rank4]: File "/workdir/gen_rec/rec/app/main.py", line 41, in
[rank4]: main()
[rank4]: File "/workdir/gen_rec/rec/app/main.py", line 24, in main
[rank4]: train(args)
[rank4]: File "/workdir/gen_rec/rec/app/task.py", line 17, in train
[rank4]: trainer.train()
[rank4]: File "/workdir/gen_rec/rec/training/trainer.py", line 278, in train
[rank4]: loss.backward()
[rank4]: File "/usr/local/lib/python3.9/site-packages/torch/_tensor.py", line 525, in backward
[rank4]: torch.autograd.backward(
[rank4]: File "/usr/local/lib/python3.9/site-packages/torch/autograd/init.py", line 267, in backward
[rank4]: _engine_run_backward(
[rank4]: File "/usr/local/lib/python3.9/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank4]: RuntimeError: Detected at least one parameter gradient is not the expected DDP bucket view with gradient_as_bucket_view=True. This may happen (for example) if multiple allreduce hooks were registered onto the same parameter. If you hit this error, please file an issue with a minimal repro.
Moreover, I've read the code of FUSED compute kernel and DENSE(DATA_PARALLEL) compute kernel. I find that there are some code about optimizer in FUSED compute kernel but there are none in DENSE compute kernel. It seems that DATA_PARALLEL is incompatible with FusedEmbeddingBagCollection.
I am not sure what really happened. Could anyone help me to address this problem?
The text was updated successfully, but these errors were encountered:
I am trying to apply DATA_PARALLEL on the small embedding tables and it can work in EmbeddingBagCollection. However, when it comes to FusedEmbeddingBagCollection, it doesn't work and gets an error on the second backward step. The error is like below:
Moreover, I've read the code of FUSED compute kernel and DENSE(DATA_PARALLEL) compute kernel. I find that there are some code about optimizer in FUSED compute kernel but there are none in DENSE compute kernel. It seems that DATA_PARALLEL is incompatible with FusedEmbeddingBagCollection.
I am not sure what really happened. Could anyone help me to address this problem?
The text was updated successfully, but these errors were encountered: