-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #26
Comments
@PeterAJansen can you try a smaller batch size? Something less than 8? |
@MohitShridhar I forgot to mention this too -- smaller batch sizes produced the same error. The Titan RTX has 24gb of RAM, hopefully plenty for moderate batch sizes. |
Ah I see. Have you seen this? This error is being thrown by the PyTorch RNN module, so I am not sure what's happening here. It seems like you need to build PyTorch with the right CUDA version? |
@PeterAJansen did you make any progress on this? I just purchased a RTX 2080S, performed a fresh install of Ubuntu 18.04, |
Unfortunately no luck on my end, I was never able to get this running. If
you do figure it out, please post the solution to this thread -- I'd be
eager to give it a try.
…On Mon, Jul 20, 2020 at 12:02 PM Selma Wanna ***@***.***> wrote:
*External Email*
@PeterAJansen <https://github.com/PeterAJansen> did you make any progress
on this? I just purchased a RTX 2080S, performed a fresh install of Ubuntu
18.04,
downloaded the recommended pytorch version (1.5.1), and my CUDA version is
10.2. Despite all this effort, I still get the same error as you.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA5C7FDTSAUWHH7PQQNRU4LR4SICRANCNFSM4NFU2HQQ>
.
--
Peter Jansen, PhD
Assistant Professor, School of Information, University of Arizona
web: http://cognitiveai.org
|
Sorry, I wish I could help, but I don't have a RTX 2080S to debug this. |
No worries! I think I figured out that it might be an OOM issue. I ran it a couple of times on my 8GB GPU and saw that the training program nearly used all 8 GB. Then after rerunning the training and changing absolutely nothing regarding the training program, It was able to run (and it has been running for at least 11 hours.) I’m betting I just got lucky, and I’ll be searching for cloud compute resources for the future. |
@SouLeo I'm working with a Titan RTX with 24gb of memory, and was getting the error even with batch sizes of 1, so I don't think it was an out-of-memory issue in my case -- in case that helps you figure out what the issue ultimately was. |
Potential Fix I was running into the same issue. Ubuntu 18.04, Cuda 10.2, Titan RTX 24GB. I followed the quick install instructions. Error happened almost immediately. Smaller batch sizes did'nt help. Running without --gpu worked.
I uninstalled the versions of torch and torchvision specified in |
Well... without |
Sorry if I wasn't clear. I was stating that it does work while running on the cpu to point out that it is a cuda/gpu issue. I fixed my issue by upgrading torch to the latest version instead of the version specified by requirements.txt. I want to know if there is another reason requirements.txt uses torch 1.1.0 and if anything will break if I use torch version 1.6.0. |
Yeah, I figure there might be some API updates in torch 1.6.0 that might break the code. Especially with GPU training. |
Getting the same error with the Docker image on RTX 2080. Could be that this card is not supported by torch==1.1.0? |
@dnandha the seq2seq baselines are a bit outdated now. Checkout the SoTA models that use newer torch versions: https://github.com/askforalfred/alfred#sota-models |
Hi,
I'm seeing the same error as another person posted --
(alfred_env) (base) peter@neutronium:~/github/alfred$ python models/train/train_seq2seq.py --data data/json_feat_2.1.0 --model seq2seq_im_mask --dout exp/model:{model},name:pm_and_subgoals_01 --splits data/splits/oct21.json --gpu --batch 8 --pm_aux_loss_wt 0.1 --subgoal_aux_loss_wt 0.1 Namespace(action_loss_wt=1.0, actor_dropout=0.0, attn_dropout=0.0, batch=8, data='data/json_feat_2.1.0', dataset_fraction=0, dec_teacher_forcing=False, decay_epoch=10, demb=100, dframe=2500, dhid=512, dout='exp/model:seq2seq_im_mask,name:pm_and_subgoals_01', epoch=20, fast_epoch=False, gpu=True, hstate_dropout=0.3, input_dropout=0.0, lang_dropout=0.0, lr=0.0001, mask_loss_wt=1.0, model='seq2seq_im_mask', pframe=300, pm_aux_loss_wt=0.1, pp_folder='pp', preprocess=False, resume=None, save_every_epoch=False, seed=123, splits='data/splits/oct21.json', subgoal_aux_loss_wt=0.1, temp_no_history=False, vis_dropout=0.3, zero_goal=False, zero_instr=False) {'tests_seen': 1533, 'tests_unseen': 1529, 'train': 21023, 'valid_seen': 820, 'valid_unseen': 821} Traceback (most recent call last): File "models/train/train_seq2seq.py", line 103, in <module> model = model.to(torch.device('cuda')) File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 386, in to return self._apply(convert) File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply module._apply(fn) File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 127, in _apply self.flatten_parameters() File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
I have verified that I've followed the installation instructions, that that the correct versions of torch (1.1.0), Torchvision (0.3.0 in requirements.txt; the prose says 1.3.0 but the latest version is 0.6.0), AI2THOR (2.1.0), and tensorboardX (1.8) have been installed.
I'm using a Titan RTX and CUDA 10.1 on KUbuntu 18.04.
Model seems to start training without the --gpu option, but it appears slow (so I didn't wait to see how long it would take).
thanks!
The text was updated successfully, but these errors were encountered: