-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add customized step collector into dlrover #1352
base: master
Are you sure you want to change the base?
add customized step collector into dlrover #1352
Conversation
--training_log_file specifies which logfile we collect step info --step_rank specifies the rank which will output the step info into logfile --step_pattern specifies the regex pattern of step info matching each log line
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1352 +/- ##
==========================================
- Coverage 81.17% 81.16% -0.01%
==========================================
Files 231 231
Lines 21988 22272 +284
==========================================
+ Hits 17849 18078 +229
- Misses 4139 4194 +55 ☔ View full report in Codecov by Sentry. |
@@ -238,6 +238,13 @@ class GlobalStep(Message): | |||
elapsed_time_per_step: float = 0.0 | |||
|
|||
|
|||
@dataclass | |||
class UserStep(Message): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An advice:
- Use
DiagnosisReportData
instead of the new definition here. - Use 'type' to tag it as 'UserStep' info and do resolving in master.
@@ -177,15 +177,21 @@ class ElasticLaunchConfig(LaunchConfig): | |||
tee: Union[Std, Dict[int, Std]] = Std.NONE | |||
training_log_file: str = "" | |||
failure_node_errors: str = "" | |||
step_rank: int = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a class object here to unify step info. Or u can define a more common class to include these fields(including: 'training_log_file' and 'failure_node_errors')
usage: add such parameters into agent command line, e.g. dlrover-run
--training_log_file specifies which logfile we collect step info
--step_rank specifies the rank which will output the step info into logfile
--step_pattern specifies the regex pattern of step info matching each log line
This patch support user defined step log format, so dlrover can collect training steps info from log file. It is useful for
us to detect training errors, e.g. hang or stepback
What changes were proposed in this pull request?
step pattern and file match
Why are the changes needed?
add customized step collector into dlrover
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
UT