Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: retain only last checkpoint directory #358

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

leseb
Copy link

@leseb leseb commented Nov 27, 2024

0e8605a feat: retain only last checkpoint directory

commit 0e8605a
Author: Sébastien Han [email protected]
Date: Wed Nov 27 14:21:27 2024 +0100

feat: retain only last checkpoint directory

Introduced a new command-line argument `--keep_last_checkpoint_only`.
This flag determines whether we should only keep the last checkpoint
directory, with the previous epoch directories always being overwritten.
When this flag is enabled, the epoch directory is named `last_epoch`.

This flag is useful for managing disk space efficiently during model
training. By keeping only the last checkpoint directory and overwriting
the previous ones, it helps to significantly reduce the amount of
storage required. This is particularly beneficial when working with
large models and datasets, where each epoch can consume a substantial
amount of disk space. By enabling the --keep_last_checkpoint_only flag,
users can ensure that only the most recent model state is saved,
which is often sufficient for many training and evaluation
purposes. This approach helps to avoid clutter and maintain a
cleaner and more manageable file system.

Given the fact that we always pick epoch 7 during phase 1
training and do not perform evaluation on each epoch, one might
decide it is not worth to save all epochs. By keeping only the
last checkpoint, we can significantly reduce the amount of
storage required, avoid clutter, and maintain a cleaner and more
manageable file system.

Signed-off-by: Sébastien Han <[email protected]>

@mergify mergify bot added the documentation Improvements or additions to documentation label Nov 27, 2024
@leseb
Copy link
Author

leseb commented Nov 27, 2024

I’m not entirely sure where to add tests. I noticed tests/smoketest.sh, but it doesn’t seem to be triggered in the CI pipeline. If someone could guide me on where to include the tests, I’d be happy to add them. Thanks!

@RobotSail
Copy link
Member

Thanks for the PR @leseb , this looks good overall. Will take a closer look on Monday.

@leseb
Copy link
Author

leseb commented Dec 6, 2024

@JamesKunstle PTAL :)

@JamesKunstle
Copy link
Contributor

@leseb I like this addition. Your reasons for creating it make sense to me. To describe the behavior more directly, would you consider renaming the option to something like --shallow-checkpoint-history or something? --keep-last-epoch-only is slightly ambiguous, should be specific to checkpoints.

@RobotSail
Copy link
Member

I'm in favor of keeping it --keep-last-epoch-only since the implementation just has it always overwriting the last directory. It's also simple and easy to read.

@JamesKunstle
Copy link
Contributor

Maybe --keep-last-checkpoint-only as a middle-ground

@RobotSail
Copy link
Member

Yeah that makes sense, or --single-checkpoint-only is another that could apply here

@leseb leseb force-pushed the opt-rm-checkpoints branch 2 times, most recently from 07b4201 to 1e87619 Compare December 9, 2024 08:31
@leseb
Copy link
Author

leseb commented Dec 9, 2024

@RobotSail @JamesKunstle thanks for the valuable inputs, I decided to go with --keep_last_checkpoint_only. PTAL again :)

@leseb leseb changed the title feat: retain only last epoch directory feat: retain only last checkpoint directory Dec 9, 2024
@mergify mergify bot added the ci-failure label Dec 9, 2024
@leseb
Copy link
Author

leseb commented Dec 9, 2024

Looks like the runner ran out of space: OSError: Not enough free space to write 67108864 bytes.

Introduced a new command-line argument `--keep_last_checkpoint_only`.
This flag determines whether we should only keep the last checkpoint
directory, with the previous epoch directories always being overwritten.
When this flag is enabled, the epoch directory is named `last_epoch`.

This flag is useful for managing disk space efficiently during model
training. By keeping only the last checkpoint directory and overwriting
the previous ones, it helps to significantly reduce the amount of
storage required. This is particularly beneficial when working with
large models and datasets, where each epoch can consume a substantial
amount of disk space. By enabling the --keep_last_checkpoint_only flag,
users can ensure that only the most recent model state is saved,
which is often sufficient for many training and evaluation
purposes. This approach helps to avoid clutter and maintain a
cleaner and more manageable file system.

Given the fact that we always pick epoch 7 during phase 1
training and do not perform evaluation on each epoch, one might
decide it is not worth to save all epochs. By keeping only the
last checkpoint, we can significantly reduce the amount of
storage required, avoid clutter, and maintain a cleaner and more
manageable file system.

Signed-off-by: Sébastien Han <[email protected]>
@leseb leseb force-pushed the opt-rm-checkpoints branch from 1e87619 to 0e8605a Compare December 9, 2024 13:33
@mergify mergify bot removed the ci-failure label Dec 9, 2024
@leseb
Copy link
Author

leseb commented Dec 16, 2024

@RobotSail @JamesKunstle anything else you want me to change? Thanks

Copy link
Contributor

@JamesKunstle JamesKunstle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@mergify mergify bot added the one-approval label Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation one-approval
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants