Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement async distributed checkpoint save (#9028) #9203

Closed
wants to merge 2 commits into from

Conversation

mikolajblaz
Copy link
Collaborator

What does this PR do ?

This is a cherry-pick of async ckpt save implementation from #9028

Collection: NLP

Changelog

  • Adds async distributed checkpoint save implementation

Usage

  • set config exp_manager.checkpoint_callback_params.async_save=True
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

* Prevent duplicated checkpoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Introduce DistributedCheckpointIO

Signed-off-by: Mikołaj Błaż <[email protected]>

* Fix DistCkptIO usage

Signed-off-by: Mikołaj Błaż <[email protected]>

* Use NeMo logger

Signed-off-by: Mikołaj Błaż <[email protected]>

* [DCIO] Fix save_to dist ckpt path

Signed-off-by: Mikołaj Błaż <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add versioning to save_to

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add versioning logic to all .nemo files

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add versioning test

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add dist-ckpt test

Signed-off-by: Mikołaj Błaż <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <[email protected]>

* Rename existing ckpts instead of using different name

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add comment

Signed-off-by: Mikołaj Błaż <[email protected]>

* Use dist ckpt flag in all methods

Signed-off-by: Mikołaj Błaż <[email protected]>

* Improve error msg

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add dist ckpt unit tests

Signed-off-by: Mikołaj Błaż <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix load_checkpoint

Signed-off-by: Mikołaj Błaż <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <[email protected]>

* Fix auto-issues

Signed-off-by: Mikołaj Błaż <[email protected]>

* Fix ckpt_dir var

Signed-off-by: Mikołaj Błaż <[email protected]>

* Restore skipping behavior

The fix from prevent-duplicated-checkpoints is required to skip the checkpoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Fix steps on single-GPU machine

Signed-off-by: Mikołaj Błaż <[email protected]>

* Run dist-ckpt test on GPU

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add docs

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply black

Signed-off-by: Mikołaj Błaż <[email protected]>

* Prevent saving last for non-equal val intervals

Signed-off-by: Mikołaj Błaż <[email protected]>

* Move checkpoint on rank 0

Signed-off-by: Mikołaj Błaż <[email protected]>

* Fix num steps in tests

Signed-off-by: Mikołaj Błaż <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add async ckpt implementation

Signed-off-by: Mikołaj Błaż <[email protected]>

* Abstract AsyncFinalizableCheckpointIO away

Signed-off-by: Mikołaj Błaż <[email protected]>

* Change async_save flag location

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add debug info

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply formatting

Signed-off-by: Mikołaj Błaż <[email protected]>

* Handle multiple async saves

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply formatting

Signed-off-by: Mikołaj Błaż <[email protected]>

* Move finalization calls to a callback

Signed-off-by: Mikołaj Błaż <[email protected]>

* Avoid deadlock in teardown

Signed-off-by: Mikołaj Błaż <[email protected]>

* Adjust to MCore implementation

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add notes and copyrights

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply formatting

Signed-off-by: Mikołaj Błaż <[email protected]>

* Fix async_request attribute

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add MCore import guards

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add async test

Signed-off-by: Mikołaj Błaż <[email protected]>

* Fix finalize_fn arg

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add docs

Signed-off-by: Mikołaj Błaż <[email protected]>

* Remove checkpoints from accurate steps

Signed-off-by: Mikołaj Błaż <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix MCore class usage

Signed-off-by: Mikołaj Błaż <[email protected]>

* Update docs

Signed-off-by: Mikołaj Błaż <[email protected]>

* Fix logger usage

Signed-off-by: Mikołaj Błaż <[email protected]>

* Fix rebase

Signed-off-by: Mikołaj Błaż <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix code scan issues

Signed-off-by: Mikołaj Błaż <[email protected]>

* Remove unsused import

Signed-off-by: Mikołaj Błaż <[email protected]>

* Use dist-ckpt for Bert

Signed-off-by: Mikołaj Błaż <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix load checkpoint return val

Signed-off-by: Mikołaj Błaż <[email protected]>

* Use dist-ckpt based on sharded_state_dict

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add async logging

Signed-off-by: Mikołaj Błaż <[email protected]>

* Remove deprecated argument

Signed-off-by: Mikołaj Błaż <[email protected]>

* Use correct checkpoint_io

Signed-off-by: Mikołaj Błaż <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bad merge

Signed-off-by: Mikołaj Błaż <[email protected]>

* Improve debug msg

Signed-off-by: Mikołaj Błaż <[email protected]>

* Run async test on GPU

Signed-off-by: Mikołaj Błaż <[email protected]>

* Fix async ckpt unit test

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply isort and black reformatting

Signed-off-by: mikolajblaz <[email protected]>

* Clarify async logs

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add schema print

Signed-off-by: Mikołaj Błaż <[email protected]>

---------

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: mikolajblaz <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@mikolajblaz mikolajblaz self-assigned this May 15, 2024
@github-actions github-actions bot added core Changes to NeMo Core NLP labels May 15, 2024
@mikolajblaz mikolajblaz requested a review from dimapihtar May 15, 2024 12:11
@ericharper
Copy link
Collaborator

Closing this since it will only go to main.

@ericharper ericharper closed this May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Changes to NeMo Core NLP Run CICD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants