Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于训练时梯度的问题 #27

Open
genzhengmiaohong opened this issue Feb 27, 2024 · 4 comments
Open

关于训练时梯度的问题 #27

genzhengmiaohong opened this issue Feb 27, 2024 · 4 comments

Comments

@genzhengmiaohong
Copy link

您好,我在修改train.py文件进行网络训练的时候,在最后loss计算梯度的时候出现了如下错误:RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation,请问您知道该问题如何解决吗?我的cuda版本12.2,因此使用requirement.txt中的版本不合适,我先使用了torch2.1.0的版本,之后更换到 2.2.1+cu118版本均会出现该问题。希望您的回复。

@tangyz213
Copy link

你解决了吗?我也遇到了这个问题

@ByChelsea
Copy link
Owner

ByChelsea commented Mar 1, 2024

Can you provide more detailed error information, please? I need to pinpoint the location of the error.

@yangzc0214
Copy link

yangzc0214 commented Apr 7, 2024

Can you provide more detailed error information, please? I need to pinpoint the location of the error.

Traceback (most recent call last):
File "train.py", line 177, in
train(args)
File "train.py", line 140, in train
loss.backward()
File "C:\Users\yzc.conda\envs\APRIL_GAN\lib\site-packages\torch_tensor.py", line 522, in backward
torch.autograd.backward(
File "C:\Users\yzc.conda\envs\APRIL_GAN\lib\site-packages\torch\autograd_init_.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [8, 1369, 768]], which is output 0 of DivBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).


my env: windows11, torch 2.2.2+cu121
In my env, I modified line 122 in train.py to the following and then the error disappeared

patch_tokens[layer] = patch_tokens[layer] / patch_tokens[layer].norm(dim=-1, keepdim=True)

@oylz
Copy link

oylz commented Apr 15, 2024

fix it here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants