Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to install Pytorch plugin when running python setup.py install #383

Closed
anj-s opened this issue Apr 7, 2021 · 4 comments
Closed

Comments

@anj-s
Copy link

anj-s commented Apr 7, 2021

Describe the bug
I get the following error when attempting to run python setup.py install.

INFO: Above error indicates that this PyTorch installation does not support CUDA.
building 'byteps.torch.c_lib' extension
creating build/temp.linux-x86_64-3.8/byteps/torch
gcc -pthread -B /private/home/anj/.conda/envs/byteps_env/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DEIGEN_MPL2_ONLY=1 -DHAVE_CUDA=0 -DTORCH_VERSION=1007001000 -D_GLIBCXX_USE_CXX11_ABI=0 -DTORCH_API_INCLUDE_EXTENSION_H=1 -I3rdparty/ps-lite/include -I/public/apps/NCCL/2.7.8-1/include -I/private/home/anj/.conda/envs/byteps_env/lib/python3.8/site-packages/torch/include -I/private/home/anj/.conda/envs/byteps_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/private/home/anj/.conda/envs/byteps_env/lib/python3.8/site-packages/torch/include/TH -I/private/home/anj/.conda/envs/byteps_env/lib/python3.8/site-packages/torch/include/THC -I/private/home/anj/.conda/envs/byteps_env/include/python3.8 -c byteps/common/common.cc -o build/temp.linux-x86_64-3.8/byteps/common/common.o -std=c++14 -fPIC -Ofast -Wall -fopenmp -march=native -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from byteps/common/common.cc:20:
byteps/common/common.h:21:10: fatal error: cuda_runtime.h: No such file or directory
21 | #include <cuda_runtime.h>
| ^~~~~~~~~~~~~~~~
compilation terminated.
INFO: Unable to build PyTorch plugin, will skip it.

This works if I use a symlink to point to /usr/local/cuda instead. For some reason setting another path does not work. I also did not see build_torch_extension calling get_cuda_dirs in setup.py. How does it know which path cuda is set to?

To Reproduce
Steps to reproduce the behavior:
export BYTEPS_NCCL_HOME=/.../NCCL/2.7.8-1
export BYTEPS_CUDA_HOME=/.../cuda/11.0
git clone --recurse-submodules https://github.com/bytedance/byteps
cd byteps/
python setup.py install

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):
OS: Ubuntu
GCC version: gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
CUDA and NCCL version:
CUDA: 11.0
NCCL: 2.7.8
Framework (TF, PyTorch, MXNet): PyTorch 1.8

Additional context
Add any other context about the problem here.

@bobzhuyb
Copy link
Member

bobzhuyb commented Apr 8, 2021

Briefly checked the code, @pleasantrabbit I think we should add get_cuda_dirs to build_torch_extension.

@pleasantrabbit
Copy link
Collaborator

Briefly checked the code, @pleasantrabbit I think we should add get_cuda_dirs to build_torch_extension.

Indeed. Will update it.

@anj-s
Copy link
Author

anj-s commented Apr 11, 2021

I updated setup.py as seen in PR but I am running into the following error:

2021-04-11 08:39:27.543284: D byteps/common/global.cc:320] Shutdown BytePS: start to clean the resources (rank=1)
Traceback (most recent call last):
  File "byteps/example/pytorch/train_mnist_byteps.py", line 108, in <module>
    bps.broadcast_parameters(model.state_dict(), root_rank=0)
  File "/private/home/anj/.conda/envs/byteps_env_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/torch/__init__.py", line 287, in broadcast_parameters
    handle = byteps_push_pull(p, average=False, name=prefix+name)
  File "/private/home/anj/.conda/envs/byteps_env_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/torch/ops.py", line 174, in push_pull_async_inplace
    return _do_push_pull_async(tensor, tensor, average, name, version, priority)
  File "/private/home/anj/.conda/envs/byteps_env_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/torch/ops.py", line 71, in _do_push_pull_async
    function = _check_function(_push_pull_function_factory, tensor)
  File "/private/home/anj/.conda/envs/byteps_env_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/torch/ops.py", line 57, in _check_function
    raise ValueError('Tensor type %s is not supported.' % tensor.type())
ValueError: Tensor type torch.cuda.FloatTensor is not supported.

This goes away if I use /usr/local/cuda. Is there something else I am missing?

@anj-s
Copy link
Author

anj-s commented Apr 30, 2021

Finally figured this out: You need to add the path that you set in BYTEPS_CUDA_HOME to your $PATH env var in addition to the PR changes above.

@anj-s anj-s closed this as completed Apr 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants