Fixed wrong lr initialization when loading checkpoints #22

AleHD · 2024-11-20T10:08:30Z

For some reason the lr in the parameter groups of the optimizers is set to zero in the very first iteration. This is problematic when loading checkpoints as it essentially means you are no longer following the same training as an uninterrupted run, but rather skipping one optimization iteration. This PR fixes this issue.

Minimalistic example:

Green follows a 10 iteration training without interruption. Orange is the current behaviour, loading from iteration 5 and quickly diverging because of the incorrect lr set when loading. Brown (overlapping green) is the fix loading from iteration 5 but now follows the same expected route.

Co-authored-by: kylematoba <[email protected]>

AleHD · 2024-12-02T15:31:03Z

I have introduced Kyle's changes to finally fix the issue. Now it looks to be resuming succesfully regerdless of stage of the learning rate (during warmup, steady, cooldown or after cooldown all work fine). Should be good to merge :)

Fixed wrong lr initialization when loading checkpoints

5c4c0c6

AleHD marked this pull request as draft November 29, 2024 12:57

AleHD and others added 2 commits December 2, 2024 15:52

Now the lr is correctly resumed

f229c9d

Co-authored-by: kylematoba <[email protected]>

Remove useless comments

4d3149e

Co-authored-by: kylematoba <[email protected]>

AleHD marked this pull request as ready for review December 2, 2024 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed wrong lr initialization when loading checkpoints #22

Fixed wrong lr initialization when loading checkpoints #22

AleHD commented Nov 20, 2024

AleHD commented Dec 2, 2024

Fixed wrong lr initialization when loading checkpoints #22

Are you sure you want to change the base?

Fixed wrong lr initialization when loading checkpoints #22

Conversation

AleHD commented Nov 20, 2024

AleHD commented Dec 2, 2024