You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Happy Chinese New Year!
I tried to train this model with VG. I followed README to get start and met some problem with mix precision. So I use float32. When process went to 4812-th iteration with 12 batch size, this error occurred. Full content as follow:
Traceback (most recent call last):
File "/root/.vscode-server/extensions/ms-python.python-2021.2.633441544/pythonFiles/lib/python/debugpy/_vendored/pydevd/pydevd.py", line 3215, in <module>
File "/root/.vscode-server/extensions/ms-python.python-2021.2.633441544/pythonFiles/lib/python/debugpy/_vendored/pydevd/pydevd.py", line 3208, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/root/.vscode-server/extensions/ms-python.python-2021.2.633441544/pythonFiles/lib/python/debugpy/_vendored/pydevd/pydevd.py", line 2282, in run
return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
File "/root/.vscode-server/extensions/ms-python.python-2021.2.633441544/pythonFiles/lib/python/debugpy/_vendored/pydevd/pydevd.py", line 2289, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/root/.vscode-server/extensions/ms-python.python-2021.2.633441544/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydev_imps/_pydev_execfile.py", line 25, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "tools/relation_train_net.py", line 383, in <module>
main()
File "tools/relation_train_net.py", line 376, in main
model = train(cfg, args.local_rank, args.distributed, logger)
File "tools/relation_train_net.py", line 164, in train
scaled_losses.backward()
File "/root/miniconda3/envs/scene_graph_benchmark/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/miniconda3/envs/scene_graph_benchmark/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
I noticed the NOTE in relation_train_net.py line 161, so I commented out:
Traceback (most recent call last):
File "tools/relation_train_net.py", line 384, in <module>
main()
File "tools/relation_train_net.py", line 377, in main
model = train(cfg, args.local_rank, args.distributed, logger)
File "tools/relation_train_net.py", line 165, in train
losses.backward()
File "/root/miniconda3/envs/scene_graph_benchmark/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/miniconda3/envs/scene_graph_benchmark/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Have anyone met this issue before?
The text was updated successfully, but these errors were encountered:
❓ Questions and Help
Happy Chinese New Year!
I tried to train this model with VG. I followed README to get start and met some problem with mix precision. So I use float32. When process went to 4812-th iteration with 12 batch size, this error occurred. Full content as follow:
I noticed the NOTE in
relation_train_net.py
line 161, so I commented out:and use
It's not working... And error came to:
Have anyone met this issue before?
The text was updated successfully, but these errors were encountered: