-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The updated learning rate is different for every parameter in AdamHD #9
Comments
I suppose that 'p' instead of being a single parameter, represents a tensor containing all the parameters...is it so @gbaydin? |
Hello @gbaydin , when |
If this is the case, the updated learning rate would be different for both the parameter tensors in each optimization step I suppose. |
@gbaydin Can you please clarify on this, thanks. |
Hey,
First, nice work! :)
I'm referring to the Adam version (AdamHD). SGD doesn't seem to have that problem.
if i understand the paper correctly the gradient w.r.t. all parameters is used to update the learning rate. the learning rate is then updated once and can be used to do gradient descent on the parameters.
with your implementation, although, the learning rate is successively updated w.r.t. to the current parameter gradient (within the optimizer loop over the parameters) and then directly used for gradient descent on that parameter.
this leads effectively to a different learning rate for every parameter as it is successively modified in the process. only the last parameters in the backpropagation are updated with the learning rate that received the "full" gradient descent step.
am i missing something? thanks for your help :)
Kind regards, Heiner
The text was updated successfully, but these errors were encountered: