Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The updated learning rate is different for every parameter in AdamHD #9

Open
h-spiess opened this issue Jun 25, 2019 · 4 comments
Open

Comments

@h-spiess
Copy link

h-spiess commented Jun 25, 2019

Hey,

First, nice work! :)

I'm referring to the Adam version (AdamHD). SGD doesn't seem to have that problem.

if i understand the paper correctly the gradient w.r.t. all parameters is used to update the learning rate. the learning rate is then updated once and can be used to do gradient descent on the parameters.

with your implementation, although, the learning rate is successively updated w.r.t. to the current parameter gradient (within the optimizer loop over the parameters) and then directly used for gradient descent on that parameter.

this leads effectively to a different learning rate for every parameter as it is successively modified in the process. only the last parameters in the backpropagation are updated with the learning rate that received the "full" gradient descent step.

am i missing something? thanks for your help :)

Kind regards, Heiner

@h-spiess h-spiess changed the title The updated learning rate is different for every parameter The updated learning rate is different for every parameter in AdamHD Jun 25, 2019
@harshalmittal4
Copy link

I suppose that 'p' instead of being a single parameter, represents a tensor containing all the parameters...is it so @gbaydin?

@harshalmittal4
Copy link

harshalmittal4 commented Jul 9, 2019

Hello @gbaydin , when model.parameters() is passed as an argument to the optimizer, it represents a single parameter group.
In this parameter group, group['params'] contains 2 elements(tensors) (i.e 2 'p' s) for the logreg model; so does that mean that all parameters of the logreg model are represented by 2 tensors and both are updated at each optimization step?
Thanks!

@harshalmittal4
Copy link

harshalmittal4 commented Jul 9, 2019

If this is the case, the updated learning rate would be different for both the parameter tensors in each optimization step I suppose.

@harshalmittal4
Copy link

@gbaydin Can you please clarify on this, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants