Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle GPU lost in resource monitor #1335

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

samplise
Copy link
Collaborator

@samplise samplise commented Nov 15, 2024

What changes were proposed in this pull request?

  1. Handle the GPU lost errors in the resource monitor.
  2. Add a new field executable_time_period in DiagnosisAction. This action can only be executed after this time period. Meanwhile, DiagnosisActionQueue does not allow the enqueue of the same action. In this way, DLRover could avoid handling the same error too frequently.

Why are the changes needed?

GPU list is a strong hint for potential errors.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

Copy link

codecov bot commented Nov 15, 2024

Codecov Report

Attention: Patch coverage is 91.25000% with 21 lines in your changes missing coverage. Please review.

Project coverage is 81.31%. Comparing base (b3bf606) to head (7eb8b72).

Files with missing lines Patch % Lines
...eoperator/resolver/diagnose_gpu_errors_operator.py 78.26% 5 Missing ⚠️
...tor/observer/check_resource_collection_operator.py 80.95% 4 Missing ⚠️
.../python/elastic_agent/diagnosis/diagnosis_agent.py 85.71% 4 Missing ⚠️
dlrover/python/elastic_agent/torch/training.py 90.00% 3 Missing ⚠️
...lrover/python/diagnosis/common/diagnosis_action.py 95.91% 2 Missing ⚠️
dlrover/python/util/time_util.py 75.00% 2 Missing ⚠️
...ver/python/diagnosis/inferencechain/coordinator.py 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1335      +/-   ##
==========================================
+ Coverage   81.23%   81.31%   +0.08%     
==========================================
  Files         231      233       +2     
  Lines       22101    22297     +196     
==========================================
+ Hits        17953    18131     +178     
- Misses       4148     4166      +18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@BalaBalaYi BalaBalaYi added this to the v0.4.0 milestone Nov 18, 2024
@samplise samplise changed the title WIP: handle GPU lost in resource monitor Handle GPU lost in resource monitor Nov 19, 2024
)


class DiagnoseGPUErrorsOperator(InferenceOperator):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name should not be 'DiagnoseXXX' if it is a resolver.

@@ -25,10 +21,10 @@

class AgentContext(Singleton):
def __init__(self):
self._worker_spec: Optional[WorkerSpec] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the class spec?

observe_problems = self._observe([inf])
action = self.diagnose_problems(observe_problems)
if isinstance(action, NodeAction):
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 'return' here?

@@ -93,6 +96,7 @@ def __init__(self):
self._gpu_enabled = False
self._gpu_stats: list[GPUStats] = []
self._master_client = MasterClient.singleton_instance()
self._diagnosis_agent = DiagnosisAgent.singleton_instance()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should create a 'operator' to replace the current ResourceMonitor instead of giving ResourceMonitor the DiagnosisAgent.

if isinstance(action, NoAction):
return
self._process_diagnosis_action(action)
# avoid to execute the same event action too frequently
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need more annotations here to explain the logic

time_diff = timestamp_diff_in_seconds(
action.timestamp, datetime.now().timestamp()
)
expired_time_period = action.expired_time_period - time_diff
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should take care of 'expired_time_period <= 0'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants