Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid gradient when finetuning and learning rate with gradient clip setting #65

Open
skill-diver opened this issue Jul 25, 2024 · 6 comments

Comments

@skill-diver
Copy link

skill-diver commented Jul 25, 2024

Hi Author,

Thank you for sharing this project and for your kindness for answering my previous questions. I have some of questions want to ask about training:

  1. What is your default learning rate and gradient clip setting when training from scratch?
  2. I tried to replace the dino part in the encoder with another VIT and the performance got bad. So, I decide to finetune your weights. But I will get an invalid gradient if I use the code's learning rate and gradient clip now. So I choose unfreeze layer training during epochs to solve this problem (But the performance increase is really slow).
  3. Would you happen to have some better suggestions to avoid the invalid gradient when finetuning your model with new VIT? My current idea is train from scratch but does that need to spend too much time?

Thank you so much.

@Parskatt
Copy link
Owner

Parskatt commented Jul 25, 2024

  1. Its different depending on the encoder and decoder, the settings should be in the train experiment. Grad clip is 0.01 I think. Basically you can set grad clip thr super low so all gradients are clipped. This helped a bit with stability.

  2. The model is trained with a step lr, so that at the end the lr is /10 the original onr. If you want to finetune, I suggest that lr.

  3. It's probably difficult to replace the vit without scratch since the features will be different.

Sorry for lack of detail, on my phone and cant check stuff right now.

If you have issues with stability, you could check which params give nans and manually use fp32 there.

You might also want to freeze the batchnorm of the network, ive found the batchnorm can cause a lot of issues.

@skill-diver
Copy link
Author

skill-diver commented Jul 27, 2024

How many days you spend to train the roma model? I also find if I replace the dino with other vit the training result is bad

@Devoe-97
Copy link

How many days you spend to train the roma model? I also find if I replace the dino with other vit the training result is bad

I'm stuck with the same problem, do you have any ideas on how to solve it?

@Parskatt
Copy link
Owner

It was trained for 4 days with 4 A100 GPUs. You can also avoid issues by using bfloat16 instead of float16.

@Devoe-97
Copy link

3. My current idea is train from scratch

Gradient is NAN when training from scratch, is there any solution to this?

@Devoe-97
Copy link

It was trained for 4 days with 4 A100 GPUs. You can also avoid issues by using bfloat16 instead of float16.

hello, I compared the ROMA and DKM and found that the main differences are the coordinate_decoder implementation and the use of DINO features. I have trouble understanding that ROMA seems to be harder to converge and easier to get NAN, while DKM doesn't even need to clip and scale the grad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants