Unable to install Pytorch plugin when running python setup.py install #383

anj-s · 2021-04-07T23:51:34Z

Describe the bug
I get the following error when attempting to run python setup.py install.

INFO: Above error indicates that this PyTorch installation does not support CUDA.
building 'byteps.torch.c_lib' extension
creating build/temp.linux-x86_64-3.8/byteps/torch
gcc -pthread -B /private/home/anj/.conda/envs/byteps_env/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DEIGEN_MPL2_ONLY=1 -DHAVE_CUDA=0 -DTORCH_VERSION=1007001000 -D_GLIBCXX_USE_CXX11_ABI=0 -DTORCH_API_INCLUDE_EXTENSION_H=1 -I3rdparty/ps-lite/include -I/public/apps/NCCL/2.7.8-1/include -I/private/home/anj/.conda/envs/byteps_env/lib/python3.8/site-packages/torch/include -I/private/home/anj/.conda/envs/byteps_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/private/home/anj/.conda/envs/byteps_env/lib/python3.8/site-packages/torch/include/TH -I/private/home/anj/.conda/envs/byteps_env/lib/python3.8/site-packages/torch/include/THC -I/private/home/anj/.conda/envs/byteps_env/include/python3.8 -c byteps/common/common.cc -o build/temp.linux-x86_64-3.8/byteps/common/common.o -std=c++14 -fPIC -Ofast -Wall -fopenmp -march=native -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from byteps/common/common.cc:20:
byteps/common/common.h:21:10: fatal error: cuda_runtime.h: No such file or directory
21 | #include <cuda_runtime.h>
| ^~~~~~~~~~~~~~~~
compilation terminated.
INFO: Unable to build PyTorch plugin, will skip it.

This works if I use a symlink to point to /usr/local/cuda instead. For some reason setting another path does not work. I also did not see build_torch_extension calling get_cuda_dirs in setup.py. How does it know which path cuda is set to?

To Reproduce
Steps to reproduce the behavior:
export BYTEPS_NCCL_HOME=/.../NCCL/2.7.8-1
export BYTEPS_CUDA_HOME=/.../cuda/11.0
git clone --recurse-submodules https://github.com/bytedance/byteps
cd byteps/
python setup.py install

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):
OS: Ubuntu
GCC version: gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
CUDA and NCCL version:
CUDA: 11.0
NCCL: 2.7.8
Framework (TF, PyTorch, MXNet): PyTorch 1.8

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

bobzhuyb · 2021-04-08T05:45:01Z

Briefly checked the code, @pleasantrabbit I think we should add get_cuda_dirs to build_torch_extension.

pleasantrabbit · 2021-04-08T06:25:00Z

Briefly checked the code, @pleasantrabbit I think we should add get_cuda_dirs to build_torch_extension.

Indeed. Will update it.

anj-s · 2021-04-11T16:39:13Z

I updated setup.py as seen in PR but I am running into the following error:

2021-04-11 08:39:27.543284: D byteps/common/global.cc:320] Shutdown BytePS: start to clean the resources (rank=1)
Traceback (most recent call last):
  File "byteps/example/pytorch/train_mnist_byteps.py", line 108, in <module>
    bps.broadcast_parameters(model.state_dict(), root_rank=0)
  File "/private/home/anj/.conda/envs/byteps_env_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/torch/__init__.py", line 287, in broadcast_parameters
    handle = byteps_push_pull(p, average=False, name=prefix+name)
  File "/private/home/anj/.conda/envs/byteps_env_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/torch/ops.py", line 174, in push_pull_async_inplace
    return _do_push_pull_async(tensor, tensor, average, name, version, priority)
  File "/private/home/anj/.conda/envs/byteps_env_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/torch/ops.py", line 71, in _do_push_pull_async
    function = _check_function(_push_pull_function_factory, tensor)
  File "/private/home/anj/.conda/envs/byteps_env_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/torch/ops.py", line 57, in _check_function
    raise ValueError('Tensor type %s is not supported.' % tensor.type())
ValueError: Tensor type torch.cuda.FloatTensor is not supported.

This goes away if I use /usr/local/cuda. Is there something else I am missing?

anj-s · 2021-04-30T04:27:01Z

Finally figured this out: You need to add the path that you set in BYTEPS_CUDA_HOME to your $PATH env var in addition to the PR changes above.

anj-s mentioned this issue Apr 7, 2021

Unable to run training on a single node due to " Check failed: r == ncclSuccess NCCL error: unhandled cuda error" #380

Closed

anj-s closed this as completed Apr 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to install Pytorch plugin when running python setup.py install #383

Unable to install Pytorch plugin when running python setup.py install #383

anj-s commented Apr 7, 2021

bobzhuyb commented Apr 8, 2021

pleasantrabbit commented Apr 8, 2021

anj-s commented Apr 11, 2021

anj-s commented Apr 30, 2021 •

edited

Loading

Unable to install Pytorch plugin when running python setup.py install #383

Unable to install Pytorch plugin when running python setup.py install #383

Comments

anj-s commented Apr 7, 2021

bobzhuyb commented Apr 8, 2021

pleasantrabbit commented Apr 8, 2021

anj-s commented Apr 11, 2021

anj-s commented Apr 30, 2021 • edited Loading

anj-s commented Apr 30, 2021 •

edited

Loading