feat: retain only last checkpoint directory #358

leseb · 2024-11-27T13:22:16Z

0e8605a feat: retain only last checkpoint directory

commit 0e8605a
Author: Sébastien Han [email protected]
Date: Wed Nov 27 14:21:27 2024 +0100

feat: retain only last checkpoint directory

Introduced a new command-line argument `--keep_last_checkpoint_only`.
This flag determines whether we should only keep the last checkpoint
directory, with the previous epoch directories always being overwritten.
When this flag is enabled, the epoch directory is named `last_epoch`.

This flag is useful for managing disk space efficiently during model
training. By keeping only the last checkpoint directory and overwriting
the previous ones, it helps to significantly reduce the amount of
storage required. This is particularly beneficial when working with
large models and datasets, where each epoch can consume a substantial
amount of disk space. By enabling the --keep_last_checkpoint_only flag,
users can ensure that only the most recent model state is saved,
which is often sufficient for many training and evaluation
purposes. This approach helps to avoid clutter and maintain a
cleaner and more manageable file system.

Given the fact that we always pick epoch 7 during phase 1
training and do not perform evaluation on each epoch, one might
decide it is not worth to save all epochs. By keeping only the
last checkpoint, we can significantly reduce the amount of
storage required, avoid clutter, and maintain a cleaner and more
manageable file system.

Signed-off-by: Sébastien Han <[email protected]>

leseb · 2024-11-27T13:23:21Z

I’m not entirely sure where to add tests. I noticed tests/smoketest.sh, but it doesn’t seem to be triggered in the CI pipeline. If someone could guide me on where to include the tests, I’d be happy to add them. Thanks!

RobotSail · 2024-11-27T23:28:14Z

Thanks for the PR @leseb , this looks good overall. Will take a closer look on Monday.

leseb · 2024-12-06T08:39:47Z

@JamesKunstle PTAL :)

JamesKunstle · 2024-12-07T18:51:25Z

@leseb I like this addition. Your reasons for creating it make sense to me. To describe the behavior more directly, would you consider renaming the option to something like --shallow-checkpoint-history or something? --keep-last-epoch-only is slightly ambiguous, should be specific to checkpoints.

RobotSail · 2024-12-07T21:07:07Z

I'm in favor of keeping it --keep-last-epoch-only since the implementation just has it always overwriting the last directory. It's also simple and easy to read.

JamesKunstle · 2024-12-07T21:27:56Z

Maybe --keep-last-checkpoint-only as a middle-ground

RobotSail · 2024-12-07T21:31:52Z

Yeah that makes sense, or --single-checkpoint-only is another that could apply here

leseb · 2024-12-09T08:32:26Z

@RobotSail @JamesKunstle thanks for the valuable inputs, I decided to go with --keep_last_checkpoint_only. PTAL again :)

leseb · 2024-12-09T13:32:20Z

Looks like the runner ran out of space: OSError: Not enough free space to write 67108864 bytes.

Introduced a new command-line argument `--keep_last_checkpoint_only`. This flag determines whether we should only keep the last checkpoint directory, with the previous epoch directories always being overwritten. When this flag is enabled, the epoch directory is named `last_epoch`. This flag is useful for managing disk space efficiently during model training. By keeping only the last checkpoint directory and overwriting the previous ones, it helps to significantly reduce the amount of storage required. This is particularly beneficial when working with large models and datasets, where each epoch can consume a substantial amount of disk space. By enabling the --keep_last_checkpoint_only flag, users can ensure that only the most recent model state is saved, which is often sufficient for many training and evaluation purposes. This approach helps to avoid clutter and maintain a cleaner and more manageable file system. Given the fact that we always pick epoch 7 during phase 1 training and do not perform evaluation on each epoch, one might decide it is not worth to save all epochs. By keeping only the last checkpoint, we can significantly reduce the amount of storage required, avoid clutter, and maintain a cleaner and more manageable file system. Signed-off-by: Sébastien Han <[email protected]>

leseb · 2024-12-16T10:29:59Z

@RobotSail @JamesKunstle anything else you want me to change? Thanks

JamesKunstle

lgtm!

mergify bot added the documentation Improvements or additions to documentation label Nov 27, 2024

leseb force-pushed the opt-rm-checkpoints branch 2 times, most recently from 07b4201 to 1e87619 Compare December 9, 2024 08:31

leseb changed the title ~~feat: retain only last epoch directory~~ feat: retain only last checkpoint directory Dec 9, 2024

mergify bot added the ci-failure label Dec 9, 2024

leseb force-pushed the opt-rm-checkpoints branch from 1e87619 to 0e8605a Compare December 9, 2024 13:33

mergify bot removed the ci-failure label Dec 9, 2024

JamesKunstle approved these changes Dec 16, 2024

View reviewed changes

mergify bot added the one-approval label Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: retain only last checkpoint directory #358

feat: retain only last checkpoint directory #358

leseb commented Nov 27, 2024 •

edited

Loading

leseb commented Nov 27, 2024

RobotSail commented Nov 27, 2024

leseb commented Dec 6, 2024

JamesKunstle commented Dec 7, 2024

RobotSail commented Dec 7, 2024

JamesKunstle commented Dec 7, 2024

RobotSail commented Dec 7, 2024

leseb commented Dec 9, 2024

leseb commented Dec 9, 2024

leseb commented Dec 16, 2024

JamesKunstle left a comment

feat: retain only last checkpoint directory #358

Are you sure you want to change the base?

feat: retain only last checkpoint directory #358

Conversation

leseb commented Nov 27, 2024 • edited Loading

leseb commented Nov 27, 2024

RobotSail commented Nov 27, 2024

leseb commented Dec 6, 2024

JamesKunstle commented Dec 7, 2024

RobotSail commented Dec 7, 2024

JamesKunstle commented Dec 7, 2024

RobotSail commented Dec 7, 2024

leseb commented Dec 9, 2024

leseb commented Dec 9, 2024

leseb commented Dec 16, 2024

JamesKunstle left a comment

Choose a reason for hiding this comment

leseb commented Nov 27, 2024 •

edited

Loading