Unable to run training on a single node due to " Check failed: r == ncclSuccess NCCL error: unhandled cuda error" #380

anj-s · 2021-03-31T21:06:38Z

Describe the bug
Running single node training with PyTorch backed fails with the following error:
"Check failed: r == ncclSuccess NCCL error: unhandled cuda error"

It fails at this point in the code: https://github.com/bytedance/byteps/blob/master/byteps/torch/__init__.py#L290

To Reproduce
Steps to reproduce the behavior:

git clone --recurse-submodules https://github.com/bytedance/byteps
cd byteps && python setup.py install
python byteps_launcher.py using https://gist.github.com/anj-s/958a7e444100e762180bf289da8a6cab
You should see the error similar to https://gist.github.com/anj-s/5ff0eafd4309a16fd480cc5662aff448

Expected behavior
Successful run without issues.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS: Ubuntu
GCC version: gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
CUDA and NCCL version:
CUDA: 10.1
NCCL: 2.4.7
Framework (TF, PyTorch, MXNet): PyTorch 1.7

Additional context
Packages installed: https://gist.github.com/anj-s/fe4ca6bc3630aa2e9ca3ba5344d09106
Output of nvidia-smi: https://gist.github.com/anj-s/2e097432a6ec3962655d476f756bcdc4

The text was updated successfully, but these errors were encountered:

anj-s · 2021-04-02T16:07:46Z

Friendly ping about this issue. Would be great if someone could take a look. Thanks!
The run works when I use only 1 GPU but fails when using 2 GPUs.

anj-s · 2021-04-07T22:47:02Z

I was able to get around this error by 1) switching around the order of calling broadcast_parameters and broadcast_optimizer_state OR 2) by running train() once before calling these functions.
Once I had a successful run it was not possible to reproduce the error again.

bobzhuyb · 2021-04-07T23:11:52Z

@anj-s Thanks a lot for the update! I am sorry that we did not help you earlier.

This error often happens during broadcast and it's often due to passing in CPU tensors. It sometimes seems to be hardware platform dependent, i.e., some machines would run it fine and some complain about it. We are looking into how we can make it more robust.

anj-s · 2021-04-07T23:53:34Z

Thanks @bobzhuyb for responding. I've seen this behavior before and usually priming the NCCL connection with an allreduce op of a single (1, 1) tensor has helped. I'll probably put that in and see if it helps.

I do have another issue if don't mind taking a look. Thanks!

anj-s closed this as completed Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to run training on a single node due to " Check failed: r == ncclSuccess NCCL error: unhandled cuda error" #380

Unable to run training on a single node due to " Check failed: r == ncclSuccess NCCL error: unhandled cuda error" #380

anj-s commented Mar 31, 2021

anj-s commented Apr 2, 2021 •

edited

Loading

anj-s commented Apr 7, 2021

bobzhuyb commented Apr 7, 2021

anj-s commented Apr 7, 2021 •

edited

Loading

Unable to run training on a single node due to " Check failed: r == ncclSuccess NCCL error: unhandled cuda error" #380

Unable to run training on a single node due to " Check failed: r == ncclSuccess NCCL error: unhandled cuda error" #380

Comments

anj-s commented Mar 31, 2021

anj-s commented Apr 2, 2021 • edited Loading

anj-s commented Apr 7, 2021

bobzhuyb commented Apr 7, 2021

anj-s commented Apr 7, 2021 • edited Loading

anj-s commented Apr 2, 2021 •

edited

Loading

anj-s commented Apr 7, 2021 •

edited

Loading