Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory errors #2

Open
filipeabperes opened this issue Jun 16, 2020 · 2 comments
Open

Out of memory errors #2

filipeabperes opened this issue Jun 16, 2020 · 2 comments

Comments

@filipeabperes
Copy link

filipeabperes commented Jun 16, 2020

Hi, thanks for making your work openly available!

I have been running into some issues when following the instructions on your readme.

If I try to run the training script on one GPU (with approx 40GB of memory available), with CUDA_VISIBLE_DEVICES=0 python train.py --data-dir data/grandcentral/

I get an out of memory error

RuntimeError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 47.46 GiB total capacity; 33.19 GiB already allocated; 32.00 MiB free; 33.38 GiB reserved in total by PyTorch)

Similarly, if I run on multiple GPUs (10 GPUs with approx 40GB of memory each) with the same command as before (but more GPUs specified), I get

RuntimeError: cuda runtime error (711) : peer mapping resources exhausted at /opt/conda/conda-bld/pytorch_1587428207430/work/aten/src/THC/THCGeneral.cpp:136

Finally, if I run the same command as before, but with the --nocuda flag, in order to run it on the CPU, I get the following error

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu

Any idea on how to fix these issues?

Thanks

@JindongJiang
Copy link
Owner

Hi, can you try os.environ['CUDA_LAUNCH_BLOCKING'] = '1' and show me the log? And also, can you try using 4 GPUs each with 40GB?

@filipeabperes
Copy link
Author

filipeabperes commented Jun 22, 2020

If I try running with the CUDA_LAUNCH_BLOCKING set, it just blocks.

However, I tried running with 4 GPUs as you recommended (without the blocking flag) and it works. I had only tried with 1 and 10 GPUs previously and those don't work. Thanks for the tip!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants