Out of memory errors #2

filipeabperes · 2020-06-16T18:03:43Z

Hi, thanks for making your work openly available!

I have been running into some issues when following the instructions on your readme.

If I try to run the training script on one GPU (with approx 40GB of memory available), with CUDA_VISIBLE_DEVICES=0 python train.py --data-dir data/grandcentral/

I get an out of memory error

RuntimeError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 47.46 GiB total capacity; 33.19 GiB already allocated; 32.00 MiB free; 33.38 GiB reserved in total by PyTorch)

Similarly, if I run on multiple GPUs (10 GPUs with approx 40GB of memory each) with the same command as before (but more GPUs specified), I get

RuntimeError: cuda runtime error (711) : peer mapping resources exhausted at /opt/conda/conda-bld/pytorch_1587428207430/work/aten/src/THC/THCGeneral.cpp:136

Finally, if I run the same command as before, but with the --nocuda flag, in order to run it on the CPU, I get the following error

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu

Any idea on how to fix these issues?

Thanks

The text was updated successfully, but these errors were encountered:

JindongJiang · 2020-06-20T21:37:11Z

Hi, can you try os.environ['CUDA_LAUNCH_BLOCKING'] = '1' and show me the log? And also, can you try using 4 GPUs each with 40GB?

filipeabperes · 2020-06-22T18:40:53Z

If I try running with the CUDA_LAUNCH_BLOCKING set, it just blocks.

However, I tried running with 4 GPUs as you recommended (without the blocking flag) and it works. I had only tried with 1 and 10 GPUs previously and those don't work. Thanks for the tip!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory errors #2

Out of memory errors #2

filipeabperes commented Jun 16, 2020 •

edited

Loading

JindongJiang commented Jun 20, 2020

filipeabperes commented Jun 22, 2020 •

edited

Loading

Out of memory errors #2

Out of memory errors #2

Comments

filipeabperes commented Jun 16, 2020 • edited Loading

JindongJiang commented Jun 20, 2020

filipeabperes commented Jun 22, 2020 • edited Loading

filipeabperes commented Jun 16, 2020 •

edited

Loading

filipeabperes commented Jun 22, 2020 •

edited

Loading