You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been running into some issues when following the instructions on your readme.
If I try to run the training script on one GPU (with approx 40GB of memory available), with CUDA_VISIBLE_DEVICES=0 python train.py --data-dir data/grandcentral/
I get an out of memory error
RuntimeError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 47.46 GiB total capacity; 33.19 GiB already allocated; 32.00 MiB free; 33.38 GiB reserved in total by PyTorch)
Similarly, if I run on multiple GPUs (10 GPUs with approx 40GB of memory each) with the same command as before (but more GPUs specified), I get
RuntimeError: cuda runtime error (711) : peer mapping resources exhausted at /opt/conda/conda-bld/pytorch_1587428207430/work/aten/src/THC/THCGeneral.cpp:136
Finally, if I run the same command as before, but with the --nocuda flag, in order to run it on the CPU, I get the following error
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu
Any idea on how to fix these issues?
Thanks
The text was updated successfully, but these errors were encountered:
If I try running with the CUDA_LAUNCH_BLOCKING set, it just blocks.
However, I tried running with 4 GPUs as you recommended (without the blocking flag) and it works. I had only tried with 1 and 10 GPUs previously and those don't work. Thanks for the tip!
Hi, thanks for making your work openly available!
I have been running into some issues when following the instructions on your readme.
If I try to run the training script on one GPU (with approx 40GB of memory available), with
CUDA_VISIBLE_DEVICES=0 python train.py --data-dir data/grandcentral/
I get an out of memory error
Similarly, if I run on multiple GPUs (10 GPUs with approx 40GB of memory each) with the same command as before (but more GPUs specified), I get
Finally, if I run the same command as before, but with the --nocuda flag, in order to run it on the CPU, I get the following error
Any idea on how to fix these issues?
Thanks
The text was updated successfully, but these errors were encountered: