-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement async distributed checkpoint save (#9028) #9203
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Prevent duplicated checkpoints Signed-off-by: Mikołaj Błaż <[email protected]> * Introduce DistributedCheckpointIO Signed-off-by: Mikołaj Błaż <[email protected]> * Fix DistCkptIO usage Signed-off-by: Mikołaj Błaż <[email protected]> * Use NeMo logger Signed-off-by: Mikołaj Błaż <[email protected]> * [DCIO] Fix save_to dist ckpt path Signed-off-by: Mikołaj Błaż <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add versioning to save_to Signed-off-by: Mikołaj Błaż <[email protected]> * Add versioning logic to all .nemo files Signed-off-by: Mikołaj Błaż <[email protected]> * Add versioning test Signed-off-by: Mikołaj Błaż <[email protected]> * Add dist-ckpt test Signed-off-by: Mikołaj Błaż <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Mikołaj Błaż <[email protected]> * Rename existing ckpts instead of using different name Signed-off-by: Mikołaj Błaż <[email protected]> * Add comment Signed-off-by: Mikołaj Błaż <[email protected]> * Use dist ckpt flag in all methods Signed-off-by: Mikołaj Błaż <[email protected]> * Improve error msg Signed-off-by: Mikołaj Błaż <[email protected]> * Add dist ckpt unit tests Signed-off-by: Mikołaj Błaż <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix load_checkpoint Signed-off-by: Mikołaj Błaż <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Mikołaj Błaż <[email protected]> * Fix auto-issues Signed-off-by: Mikołaj Błaż <[email protected]> * Fix ckpt_dir var Signed-off-by: Mikołaj Błaż <[email protected]> * Restore skipping behavior The fix from prevent-duplicated-checkpoints is required to skip the checkpoints Signed-off-by: Mikołaj Błaż <[email protected]> * Fix steps on single-GPU machine Signed-off-by: Mikołaj Błaż <[email protected]> * Run dist-ckpt test on GPU Signed-off-by: Mikołaj Błaż <[email protected]> * Add docs Signed-off-by: Mikołaj Błaż <[email protected]> * Apply black Signed-off-by: Mikołaj Błaż <[email protected]> * Prevent saving last for non-equal val intervals Signed-off-by: Mikołaj Błaż <[email protected]> * Move checkpoint on rank 0 Signed-off-by: Mikołaj Błaż <[email protected]> * Fix num steps in tests Signed-off-by: Mikołaj Błaż <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Mikołaj Błaż <[email protected]> * Add async ckpt implementation Signed-off-by: Mikołaj Błaż <[email protected]> * Abstract AsyncFinalizableCheckpointIO away Signed-off-by: Mikołaj Błaż <[email protected]> * Change async_save flag location Signed-off-by: Mikołaj Błaż <[email protected]> * Add debug info Signed-off-by: Mikołaj Błaż <[email protected]> * Apply formatting Signed-off-by: Mikołaj Błaż <[email protected]> * Handle multiple async saves Signed-off-by: Mikołaj Błaż <[email protected]> * Apply formatting Signed-off-by: Mikołaj Błaż <[email protected]> * Move finalization calls to a callback Signed-off-by: Mikołaj Błaż <[email protected]> * Avoid deadlock in teardown Signed-off-by: Mikołaj Błaż <[email protected]> * Adjust to MCore implementation Signed-off-by: Mikołaj Błaż <[email protected]> * Add notes and copyrights Signed-off-by: Mikołaj Błaż <[email protected]> * Apply formatting Signed-off-by: Mikołaj Błaż <[email protected]> * Fix async_request attribute Signed-off-by: Mikołaj Błaż <[email protected]> * Add MCore import guards Signed-off-by: Mikołaj Błaż <[email protected]> * Add async test Signed-off-by: Mikołaj Błaż <[email protected]> * Fix finalize_fn arg Signed-off-by: Mikołaj Błaż <[email protected]> * Add docs Signed-off-by: Mikołaj Błaż <[email protected]> * Remove checkpoints from accurate steps Signed-off-by: Mikołaj Błaż <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix MCore class usage Signed-off-by: Mikołaj Błaż <[email protected]> * Update docs Signed-off-by: Mikołaj Błaż <[email protected]> * Fix logger usage Signed-off-by: Mikołaj Błaż <[email protected]> * Fix rebase Signed-off-by: Mikołaj Błaż <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix code scan issues Signed-off-by: Mikołaj Błaż <[email protected]> * Remove unsused import Signed-off-by: Mikołaj Błaż <[email protected]> * Use dist-ckpt for Bert Signed-off-by: Mikołaj Błaż <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix load checkpoint return val Signed-off-by: Mikołaj Błaż <[email protected]> * Use dist-ckpt based on sharded_state_dict Signed-off-by: Mikołaj Błaż <[email protected]> * Add async logging Signed-off-by: Mikołaj Błaż <[email protected]> * Remove deprecated argument Signed-off-by: Mikołaj Błaż <[email protected]> * Use correct checkpoint_io Signed-off-by: Mikołaj Błaż <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bad merge Signed-off-by: Mikołaj Błaż <[email protected]> * Improve debug msg Signed-off-by: Mikołaj Błaż <[email protected]> * Run async test on GPU Signed-off-by: Mikołaj Błaż <[email protected]> * Fix async ckpt unit test Signed-off-by: Mikołaj Błaż <[email protected]> * Apply isort and black reformatting Signed-off-by: mikolajblaz <[email protected]> * Clarify async logs Signed-off-by: Mikołaj Błaż <[email protected]> * Add schema print Signed-off-by: Mikołaj Błaż <[email protected]> --------- Signed-off-by: Mikołaj Błaż <[email protected]> Signed-off-by: mikolajblaz <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Closing this since it will only go to main. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
This is a cherry-pick of async ckpt save implementation from #9028
Collection: NLP
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information