-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle GPU lost in resource monitor #1335
base: master
Are you sure you want to change the base?
Handle GPU lost in resource monitor #1335
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1335 +/- ##
==========================================
+ Coverage 81.23% 81.31% +0.08%
==========================================
Files 231 233 +2
Lines 22101 22297 +196
==========================================
+ Hits 17953 18131 +178
- Misses 4148 4166 +18 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
) | ||
|
||
|
||
class DiagnoseGPUErrorsOperator(InferenceOperator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name should not be 'DiagnoseXXX' if it is a resolver.
@@ -25,10 +21,10 @@ | |||
|
|||
class AgentContext(Singleton): | |||
def __init__(self): | |||
self._worker_spec: Optional[WorkerSpec] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the class spec?
observe_problems = self._observe([inf]) | ||
action = self.diagnose_problems(observe_problems) | ||
if isinstance(action, NodeAction): | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 'return' here?
@@ -93,6 +96,7 @@ def __init__(self): | |||
self._gpu_enabled = False | |||
self._gpu_stats: list[GPUStats] = [] | |||
self._master_client = MasterClient.singleton_instance() | |||
self._diagnosis_agent = DiagnosisAgent.singleton_instance() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should create a 'operator' to replace the current ResourceMonitor
instead of giving ResourceMonitor
the DiagnosisAgent
.
if isinstance(action, NoAction): | ||
return | ||
self._process_diagnosis_action(action) | ||
# avoid to execute the same event action too frequently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need more annotations here to explain the logic
time_diff = timestamp_diff_in_seconds( | ||
action.timestamp, datetime.now().timestamp() | ||
) | ||
expired_time_period = action.expired_time_period - time_diff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should take care of 'expired_time_period <= 0'
What changes were proposed in this pull request?
executable_time_period
in DiagnosisAction. This action can only be executed after this time period. Meanwhile, DiagnosisActionQueue does not allow the enqueue of the same action. In this way, DLRover could avoid handling the same error too frequently.Why are the changes needed?
GPU list is a strong hint for potential errors.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit test.