Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run training on a single node due to " Check failed: r == ncclSuccess NCCL error: unhandled cuda error" #380

Closed
anj-s opened this issue Mar 31, 2021 · 4 comments

Comments

@anj-s
Copy link

anj-s commented Mar 31, 2021

Describe the bug
Running single node training with PyTorch backed fails with the following error:
"Check failed: r == ncclSuccess NCCL error: unhandled cuda error"

It fails at this point in the code: https://github.com/bytedance/byteps/blob/master/byteps/torch/__init__.py#L290

To Reproduce
Steps to reproduce the behavior:

  1. git clone --recurse-submodules https://github.com/bytedance/byteps
  2. cd byteps && python setup.py install
  3. python byteps_launcher.py using https://gist.github.com/anj-s/958a7e444100e762180bf289da8a6cab
  4. You should see the error similar to https://gist.github.com/anj-s/5ff0eafd4309a16fd480cc5662aff448

Expected behavior
Successful run without issues.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS: Ubuntu
  • GCC version: gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
  • CUDA and NCCL version:
    CUDA: 10.1
    NCCL: 2.4.7
  • Framework (TF, PyTorch, MXNet): PyTorch 1.7

Additional context
Packages installed: https://gist.github.com/anj-s/fe4ca6bc3630aa2e9ca3ba5344d09106
Output of nvidia-smi: https://gist.github.com/anj-s/2e097432a6ec3962655d476f756bcdc4

@anj-s
Copy link
Author

anj-s commented Apr 2, 2021

Friendly ping about this issue. Would be great if someone could take a look. Thanks!
The run works when I use only 1 GPU but fails when using 2 GPUs.

@anj-s
Copy link
Author

anj-s commented Apr 7, 2021

I was able to get around this error by 1) switching around the order of calling broadcast_parameters and broadcast_optimizer_state OR 2) by running train() once before calling these functions.
Once I had a successful run it was not possible to reproduce the error again.

@anj-s anj-s closed this as completed Apr 7, 2021
@bobzhuyb
Copy link
Member

bobzhuyb commented Apr 7, 2021

@anj-s Thanks a lot for the update! I am sorry that we did not help you earlier.

This error often happens during broadcast and it's often due to passing in CPU tensors. It sometimes seems to be hardware platform dependent, i.e., some machines would run it fine and some complain about it. We are looking into how we can make it more robust.

@anj-s
Copy link
Author

anj-s commented Apr 7, 2021

Thanks @bobzhuyb for responding. I've seen this behavior before and usually priming the NCCL connection with an allreduce op of a single (1, 1) tensor has helped. I'll probably put that in and see if it helps.

I do have another issue if don't mind taking a look. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants