-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: optimize training loop #4426
Conversation
for more information, see https://pre-commit.ci
📝 Walkthrough<details>
<summary>📝 Walkthrough</summary>
## Walkthrough
The pull request introduces modifications to the `Trainer` class in `deepmd/pt/train/training.py`, focusing on enhancing the training loop's control flow and error handling. Key changes include the introduction of an error handling mechanism in the gradient clipping logic, improved logging for training and validation metrics, and adjustments to ensure that the training state is managed effectively. Several methods within the class have had their logic updated without changing their signatures.
## Changes
| File Path | Change Summary |
|-------------------------------|----------------------------------------------------------------------------------------------------|
| deepmd/pt/train/training.py | - Modified training loop to include a check for `self.wrapper.train()` after evaluation. |
| | - Updated gradient clipping logic with `error_if_nonfinite=True`. |
| | - Enhanced logging functionality for clearer training and validation results. |
| | - Multiple method logic updates in `Trainer` class without changing method signatures. |
## Possibly related PRs
- #4212: This PR modifies the logging and management of training steps in `deepmd/pt/train/training.py`, which is directly related to the changes made in the `Trainer` class regarding training state management and logging enhancements.
## Suggested reviewers
- wanghan-iapcm
- iProzd
</details> 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## devel #4426 +/- ##
=======================================
Coverage 84.64% 84.64%
=======================================
Files 614 614
Lines 57135 57136 +1
Branches 3486 3488 +2
=======================================
+ Hits 48364 48365 +1
+ Misses 7645 7644 -1
- Partials 1126 1127 +1 ☔ View full report in Codecov by Sentry. |
Improvements to the training process: * [`deepmd/pt/train/training.py`](diffhunk://#diff-a90c90dc0e6a17fbe2e930f91182805b83260484c9dc1cfac3331378ffa34935R659): Added a check to skip setting the model to training mode if it already is. The profiling result shows it takes some time to recursively set it to all models. * [`deepmd/pt/train/training.py`](diffhunk://#diff-a90c90dc0e6a17fbe2e930f91182805b83260484c9dc1cfac3331378ffa34935L686-L690): Modified the gradient clipping function to include the `error_if_nonfinite` parameter, and removed the manual check for non-finite gradients and the associated exception raising. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Improved training loop with enhanced error handling and control flow. - Updated gradient clipping logic for better error detection. - Refined logging functionality for training and validation results. - **Bug Fixes** - Prevented redundant training calls by adding conditional checks. - **Documentation** - Clarified method logic in the `Trainer` class without changing method signatures. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> (cherry picked from commit 037cf3f)
Improvements to the training process: * [`deepmd/pt/train/training.py`](diffhunk://#diff-a90c90dc0e6a17fbe2e930f91182805b83260484c9dc1cfac3331378ffa34935R659): Added a check to skip setting the model to training mode if it already is. The profiling result shows it takes some time to recursively set it to all models. * [`deepmd/pt/train/training.py`](diffhunk://#diff-a90c90dc0e6a17fbe2e930f91182805b83260484c9dc1cfac3331378ffa34935L686-L690): Modified the gradient clipping function to include the `error_if_nonfinite` parameter, and removed the manual check for non-finite gradients and the associated exception raising. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Improved training loop with enhanced error handling and control flow. - Updated gradient clipping logic for better error detection. - Refined logging functionality for training and validation results. - **Bug Fixes** - Prevented redundant training calls by adding conditional checks. - **Documentation** - Clarified method logic in the `Trainer` class without changing method signatures. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> (cherry picked from commit 037cf3f)
Improvements to the training process:
deepmd/pt/train/training.py
: Added a check to skip setting the model to training mode if it already is. The profiling result shows it takes some time to recursively set it to all models.deepmd/pt/train/training.py
: Modified the gradient clipping function to include theerror_if_nonfinite
parameter, and removed the manual check for non-finite gradients and the associated exception raising.Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Trainer
class without changing method signatures.