-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: retain only last checkpoint directory #358
base: main
Are you sure you want to change the base?
Conversation
I’m not entirely sure where to add tests. I noticed tests/smoketest.sh, but it doesn’t seem to be triggered in the CI pipeline. If someone could guide me on where to include the tests, I’d be happy to add them. Thanks! |
Thanks for the PR @leseb , this looks good overall. Will take a closer look on Monday. |
@JamesKunstle PTAL :) |
@leseb I like this addition. Your reasons for creating it make sense to me. To describe the behavior more directly, would you consider renaming the option to something like |
I'm in favor of keeping it |
Maybe |
Yeah that makes sense, or |
07b4201
to
1e87619
Compare
@RobotSail @JamesKunstle thanks for the valuable inputs, I decided to go with |
Looks like the runner ran out of space: |
Introduced a new command-line argument `--keep_last_checkpoint_only`. This flag determines whether we should only keep the last checkpoint directory, with the previous epoch directories always being overwritten. When this flag is enabled, the epoch directory is named `last_epoch`. This flag is useful for managing disk space efficiently during model training. By keeping only the last checkpoint directory and overwriting the previous ones, it helps to significantly reduce the amount of storage required. This is particularly beneficial when working with large models and datasets, where each epoch can consume a substantial amount of disk space. By enabling the --keep_last_checkpoint_only flag, users can ensure that only the most recent model state is saved, which is often sufficient for many training and evaluation purposes. This approach helps to avoid clutter and maintain a cleaner and more manageable file system. Given the fact that we always pick epoch 7 during phase 1 training and do not perform evaluation on each epoch, one might decide it is not worth to save all epochs. By keeping only the last checkpoint, we can significantly reduce the amount of storage required, avoid clutter, and maintain a cleaner and more manageable file system. Signed-off-by: Sébastien Han <[email protected]>
1e87619
to
0e8605a
Compare
@RobotSail @JamesKunstle anything else you want me to change? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
0e8605a feat: retain only last checkpoint directory
commit 0e8605a
Author: Sébastien Han [email protected]
Date: Wed Nov 27 14:21:27 2024 +0100