cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #26

PeterAJansen · 2020-05-20T07:00:03Z

Hi,

I'm seeing the same error as another person posted --

(alfred_env) (base) peter@neutronium:~/github/alfred$ python models/train/train_seq2seq.py --data data/json_feat_2.1.0 --model seq2seq_im_mask --dout exp/model:{model},name:pm_and_subgoals_01 --splits data/splits/oct21.json --gpu --batch 8 --pm_aux_loss_wt 0.1 --subgoal_aux_loss_wt 0.1 Namespace(action_loss_wt=1.0, actor_dropout=0.0, attn_dropout=0.0, batch=8, data='data/json_feat_2.1.0', dataset_fraction=0, dec_teacher_forcing=False, decay_epoch=10, demb=100, dframe=2500, dhid=512, dout='exp/model:seq2seq_im_mask,name:pm_and_subgoals_01', epoch=20, fast_epoch=False, gpu=True, hstate_dropout=0.3, input_dropout=0.0, lang_dropout=0.0, lr=0.0001, mask_loss_wt=1.0, model='seq2seq_im_mask', pframe=300, pm_aux_loss_wt=0.1, pp_folder='pp', preprocess=False, resume=None, save_every_epoch=False, seed=123, splits='data/splits/oct21.json', subgoal_aux_loss_wt=0.1, temp_no_history=False, vis_dropout=0.3, zero_goal=False, zero_instr=False) {'tests_seen': 1533, 'tests_unseen': 1529, 'train': 21023, 'valid_seen': 820, 'valid_unseen': 821} Traceback (most recent call last): File "models/train/train_seq2seq.py", line 103, in <module> model = model.to(torch.device('cuda')) File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 386, in to return self._apply(convert) File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply module._apply(fn) File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 127, in _apply self.flatten_parameters() File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I have verified that I've followed the installation instructions, that that the correct versions of torch (1.1.0), Torchvision (0.3.0 in requirements.txt; the prose says 1.3.0 but the latest version is 0.6.0), AI2THOR (2.1.0), and tensorboardX (1.8) have been installed.

I'm using a Titan RTX and CUDA 10.1 on KUbuntu 18.04.

Model seems to start training without the --gpu option, but it appears slow (so I didn't wait to see how long it would take).

thanks!

The text was updated successfully, but these errors were encountered:

MohitShridhar · 2020-05-21T02:42:41Z

@PeterAJansen can you try a smaller batch size? Something less than 8?

PeterAJansen · 2020-05-21T06:02:30Z

@MohitShridhar I forgot to mention this too -- smaller batch sizes produced the same error. The Titan RTX has 24gb of RAM, hopefully plenty for moderate batch sizes.

MohitShridhar · 2020-05-21T18:18:21Z

Ah I see. Have you seen this? This error is being thrown by the PyTorch RNN module, so I am not sure what's happening here.

It seems like you need to build PyTorch with the right CUDA version?

SouLeo · 2020-07-20T19:01:43Z

@PeterAJansen did you make any progress on this? I just purchased a RTX 2080S, performed a fresh install of Ubuntu 18.04,
downloaded the recommended pytorch version (1.5.1), and my CUDA version is 10.2. Despite all this effort, I still get the same error as you.

PeterAJansen · 2020-07-20T19:20:49Z

Unfortunately no luck on my end, I was never able to get this running. If you do figure it out, please post the solution to this thread -- I'd be eager to give it a try.

…

On Mon, Jul 20, 2020 at 12:02 PM Selma Wanna ***@***.***> wrote: *External Email* @PeterAJansen <https://github.com/PeterAJansen> did you make any progress on this? I just purchased a RTX 2080S, performed a fresh install of Ubuntu 18.04, downloaded the recommended pytorch version (1.5.1), and my CUDA version is 10.2. Despite all this effort, I still get the same error as you. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA5C7FDTSAUWHH7PQQNRU4LR4SICRANCNFSM4NFU2HQQ> .

-- Peter Jansen, PhD Assistant Professor, School of Information, University of Arizona web: http://cognitiveai.org

MohitShridhar · 2020-07-21T04:29:55Z

Sorry, I wish I could help, but I don't have a RTX 2080S to debug this.

SouLeo · 2020-07-21T04:39:42Z

No worries! I think I figured out that it might be an OOM issue. I ran it a couple of times on my 8GB GPU and saw that the training program nearly used all 8 GB.

Then after rerunning the training and changing absolutely nothing regarding the training program, It was able to run (and it has been running for at least 11 hours.)

I’m betting I just got lucky, and I’ll be searching for cloud compute resources for the future.

PeterAJansen · 2020-07-21T06:55:57Z

@SouLeo I'm working with a Titan RTX with 24gb of memory, and was getting the error even with batch sizes of 1, so I don't think it was an out-of-memory issue in my case -- in case that helps you figure out what the issue ultimately was.

kolbytn · 2020-10-07T23:33:19Z

Potential Fix

I was running into the same issue. Ubuntu 18.04, Cuda 10.2, Titan RTX 24GB. I followed the quick install instructions. Error happened almost immediately. Smaller batch sizes did'nt help. Running without --gpu worked.
Command:
CUDA_VISIBLE_DEVICES=1 python models/train/train_seq2seq.py --data data/json_feat_2.1.0 --model seq2seq_im_mask --dout exp/model:{model},name:pm_and_subgo als_01 --splits data/splits/oct21.json --gpu --batch 2 --pm_aux_loss_wt 0.1 --subgoal_aux_loss_wt 0.1 --preprocess
Output:

Namespace(action_loss_wt=1.0, actor_dropout=0.0, attn_dropout=0.0, batch=8, data='data/json_feat_2.1.0', dataset_fraction=0, dec_teacher_forcing=False, decay_epoch=10, demb=100, dframe=2500, dhid=512, dout='exp/model:seq2seq_im_mask,name:pm_and_subgoals_01', epoch=20, fast_epoch=False, gpu=True, hstate_dropout=0.3, input_dropout=0.0, lang_dropout=0.0, lr=0.0001, mask_loss_wt=1.0, model='seq2seq_im_mask', pframe=300, pm_aux_loss_wt=0.1, pp_folder='pp', preprocess=False, resume=None, save_every_epoch=False, seed=123, splits='data/splits/oct21.json', subgoal_aux_loss_wt=0.1, temp_no_history=False, vis_dropout=0.3, zero_goal=False, zero_instr=False)
{'tests_seen': 1533,
 'tests_unseen': 1529,
 'train': 21023,
 'valid_seen': 820,
 'valid_unseen': 821}
Traceback (most recent call last):
  File "models/train/train_seq2seq.py", line 103, in <module>
    model = model.to(torch.device('cuda'))
  File "/home/knotting/embodied/venv_alfred/lib/python3.6/site-packages/torch/nn/modules/module.py", line 386, in to
    return self._apply(convert)
  File "/home/knotting/embodied/venv_alfred/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/home/knotting/embodied/venv_alfred/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply
    self.flatten_parameters()
  File "/home/knotting/embodied/venv_alfred/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
    self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I uninstalled the versions of torch and torchvision specified in requirements.txt and instead installed latest versions. Everything seems to be working fine now. Is this a legitimate fix or will I run into issues using the latest pytorch with other parts of the repo?

MohitShridhar · 2020-10-08T01:35:58Z

Well... without --gpu you are training on CPU, which would be very slow.

kolbytn · 2020-10-08T01:45:34Z

Sorry if I wasn't clear. I was stating that it does work while running on the cpu to point out that it is a cuda/gpu issue.

I fixed my issue by upgrading torch to the latest version instead of the version specified by requirements.txt. I want to know if there is another reason requirements.txt uses torch 1.1.0 and if anything will break if I use torch version 1.6.0.

MohitShridhar · 2020-10-08T02:41:21Z

Yeah, I figure there might be some API updates in torch 1.6.0 that might break the code. Especially with GPU training.

dnandha · 2021-08-11T00:16:05Z

Getting the same error with the Docker image on RTX 2080. Could be that this card is not supported by torch==1.1.0?

MohitShridhar · 2021-08-11T03:32:52Z

@dnandha the seq2seq baselines are a bit outdated now. Checkout the SoTA models that use newer torch versions: https://github.com/askforalfred/alfred#sota-models

MohitShridhar mentioned this issue Oct 5, 2020

Torch/torchvision incompatibility in docker when running pretrained model #50

Closed

MohitShridhar added the pytorch issue label Oct 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #26

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #26

PeterAJansen commented May 20, 2020

MohitShridhar commented May 21, 2020

PeterAJansen commented May 21, 2020

MohitShridhar commented May 21, 2020 •

edited

Loading

SouLeo commented Jul 20, 2020

PeterAJansen commented Jul 20, 2020 via email

MohitShridhar commented Jul 21, 2020

SouLeo commented Jul 21, 2020

PeterAJansen commented Jul 21, 2020

kolbytn commented Oct 7, 2020

MohitShridhar commented Oct 8, 2020

kolbytn commented Oct 8, 2020

MohitShridhar commented Oct 8, 2020 •

edited

Loading

dnandha commented Aug 11, 2021

MohitShridhar commented Aug 11, 2021

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #26

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #26

Comments

PeterAJansen commented May 20, 2020

MohitShridhar commented May 21, 2020

PeterAJansen commented May 21, 2020

MohitShridhar commented May 21, 2020 • edited Loading

SouLeo commented Jul 20, 2020

PeterAJansen commented Jul 20, 2020 via email

MohitShridhar commented Jul 21, 2020

SouLeo commented Jul 21, 2020

PeterAJansen commented Jul 21, 2020

kolbytn commented Oct 7, 2020

MohitShridhar commented Oct 8, 2020

kolbytn commented Oct 8, 2020

MohitShridhar commented Oct 8, 2020 • edited Loading

dnandha commented Aug 11, 2021

MohitShridhar commented Aug 11, 2021

MohitShridhar commented May 21, 2020 •

edited

Loading

MohitShridhar commented Oct 8, 2020 •

edited

Loading