You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Running single node training with PyTorch backed fails with the following error:
"Check failed: r == ncclSuccess NCCL error: unhandled cuda error"
I was able to get around this error by 1) switching around the order of calling broadcast_parameters and broadcast_optimizer_state OR 2) by running train() once before calling these functions.
Once I had a successful run it was not possible to reproduce the error again.
@anj-s Thanks a lot for the update! I am sorry that we did not help you earlier.
This error often happens during broadcast and it's often due to passing in CPU tensors. It sometimes seems to be hardware platform dependent, i.e., some machines would run it fine and some complain about it. We are looking into how we can make it more robust.
Thanks @bobzhuyb for responding. I've seen this behavior before and usually priming the NCCL connection with an allreduce op of a single (1, 1) tensor has helped. I'll probably put that in and see if it helps.
I do have another issue if don't mind taking a look. Thanks!
Describe the bug
Running single node training with PyTorch backed fails with the following error:
"Check failed: r == ncclSuccess NCCL error: unhandled cuda error"
It fails at this point in the code: https://github.com/bytedance/byteps/blob/master/byteps/torch/__init__.py#L290
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Successful run without issues.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
CUDA: 10.1
NCCL: 2.4.7
Additional context
Packages installed: https://gist.github.com/anj-s/fe4ca6bc3630aa2e9ca3ba5344d09106
Output of
nvidia-smi
: https://gist.github.com/anj-s/2e097432a6ec3962655d476f756bcdc4The text was updated successfully, but these errors were encountered: