-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Learning rate increases? #4
Comments
Well, I don't know about the equation but practically, it first increases the learning rate if it is too low. @gbaydin I also experienced that the learning rate sometimes get negative especially if hypergrad_lr is high. Should we maybe place a constraint (e.g. clipping) to prevent that from happening? |
By modifying algorithms described in the original paper, adam-hd worked fine This thesis (multiplicatative hypergradient descent) helped. Negative learning rate can be seen in the original experiments but it's accepted in my understanding. Some clipping might help though. |
@myui What do you mean by fine? What was exactly wrong with the version that is in this repository? |
@akaniklaus learning rates monotonically increased certain condition because |
@myui if you look at the results in the paper and in David Martinez's thesis, you can see that the algorithms, as they are formulated in the paper, can both increase and decrease the learning rate according to the loss landscape. I think your interpretation that a monotonically increasing learning rate would be observed is not correct. It is, however, correct that a small initial learning rate is most of the time increased (almost monotonically) up to some limit in the initial part of the training, but if you run training long enough, this is almost always followed by a decay (decrease) of the learning rate during the rest of the training. The poster here gives a quick summary: https://github.com/gbaydin/hypergradient-descent/raw/master/poster/iclr_2018_poster.pdf You can of course have your own modifications of this algorithm. |
Negative learning rates sometimes happen, and it's not as catastrophic as it first sounds. It just means that the algorithm decides to backtrack (do gradient ascent instead of descent) under some conditions. In my observation, negative learning rates happen in the late stages of training where the learning rate has decayed towards a very low positive value and started to fluctuate around it. If the fluctuation is too strong, and if the decayed value is close to zero, this means that sometimes learning rate becomes negative. I think this in effect means that the algorithm stays in the same region of the loss landscape because it has converged to a (local) optimum. My view is that it is valuable to reason about this behavior and pursue a theoretical understanding of its implications, rather than adding extra heuristics to "fix" or clip this behavior. I haven't had much time to explore this yet, but hope to do so in the near future. |
Backtracking make sense. |
I have a question about the following part of the paper:
In the dot product,
∇f(θ_{t-1})・∇f(θ_{t-2})
,sign(∇f(θ_{t-1}))
andsign(∇f(θ_{t-2}))
often be same for sign.Then, learning rate
α_{t}
would be monotonically increases in the above equation wheresign(∇f(θ_{t-1})) = sign(∇f(θ_{t-2}))
.I assume the difference between a gradient at t-1 and the previous gradient at t-2 is usually small.
Am I missing something?
The text was updated successfully, but these errors were encountered: