Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add customized step collector into dlrover #1352

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

majieyue
Copy link
Collaborator

usage: add such parameters into agent command line, e.g. dlrover-run

--training_log_file specifies which logfile we collect step info
--step_rank specifies the rank which will output the step info into logfile
--step_pattern specifies the regex pattern of step info matching each log line

This patch support user defined step log format, so dlrover can collect training steps info from log file. It is useful for
us to detect training errors, e.g. hang or stepback

What changes were proposed in this pull request?

step pattern and file match

Why are the changes needed?

add customized step collector into dlrover

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

UT

--training_log_file specifies which logfile we collect step info
--step_rank specifies the rank which will output the step info into logfile
--step_pattern specifies the regex pattern of step info matching each log line
Copy link

codecov bot commented Nov 21, 2024

Codecov Report

Attention: Patch coverage is 90.04739% with 21 lines in your changes missing coverage. Please review.

Project coverage is 81.16%. Comparing base (bcb2c44) to head (b2eb12c).
Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
dlrover/python/elastic_agent/monitor/training.py 73.91% 12 Missing ⚠️
dlrover/python/elastic_agent/torch/training.py 71.42% 4 Missing ⚠️
.../python/elastic_agent/diagnosis/diagnosis_agent.py 0.00% 3 Missing ⚠️
dlrover/python/master/monitor/speed_monitor.py 92.85% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1352      +/-   ##
==========================================
- Coverage   81.17%   81.16%   -0.01%     
==========================================
  Files         231      231              
  Lines       21988    22272     +284     
==========================================
+ Hits        17849    18078     +229     
- Misses       4139     4194      +55     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -238,6 +238,13 @@ class GlobalStep(Message):
elapsed_time_per_step: float = 0.0


@dataclass
class UserStep(Message):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An advice:

  1. Use DiagnosisReportData instead of the new definition here.
  2. Use 'type' to tag it as 'UserStep' info and do resolving in master.

@@ -177,15 +177,21 @@ class ElasticLaunchConfig(LaunchConfig):
tee: Union[Std, Dict[int, Std]] = Std.NONE
training_log_file: str = ""
failure_node_errors: str = ""
step_rank: int = -1
Copy link
Collaborator

@BalaBalaYi BalaBalaYi Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a class object here to unify step info. Or u can define a more common class to include these fields(including: 'training_log_file' and 'failure_node_errors')

@majieyue majieyue added the enhancement New feature or request label Nov 27, 2024
@majieyue majieyue modified the milestones: Backlog, v0.4.0 Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants