Handle GPU lost in resource monitor #1335

samplise · 2024-11-15T00:54:56Z

What changes were proposed in this pull request?

Handle the GPU lost errors in the resource monitor.
Add a new field executable_time_period in DiagnosisAction. This action can only be executed after this time period. Meanwhile, DiagnosisActionQueue does not allow the enqueue of the same action. In this way, DLRover could avoid handling the same error too frequently.

Why are the changes needed?

GPU list is a strong hint for potential errors.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

codecov · 2024-11-15T01:27:32Z

Codecov Report

Attention: Patch coverage is 91.25000% with 21 lines in your changes missing coverage. Please review.

Project coverage is 81.31%. Comparing base (b3bf606) to head (7eb8b72).

Files with missing lines	Patch %	Lines
...eoperator/resolver/diagnose_gpu_errors_operator.py	78.26%	5 Missing ⚠️
...tor/observer/check_resource_collection_operator.py	80.95%	4 Missing ⚠️
.../python/elastic_agent/diagnosis/diagnosis_agent.py	85.71%	4 Missing ⚠️
dlrover/python/elastic_agent/torch/training.py	90.00%	3 Missing ⚠️
...lrover/python/diagnosis/common/diagnosis_action.py	95.91%	2 Missing ⚠️
dlrover/python/util/time_util.py	75.00%	2 Missing ⚠️
...ver/python/diagnosis/inferencechain/coordinator.py	85.71%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1335      +/-   ##
==========================================
+ Coverage   81.23%   81.31%   +0.08%     
==========================================
  Files         231      233       +2     
  Lines       22101    22297     +196     
==========================================
+ Hits        17953    18131     +178     
- Misses       4148     4166      +18

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

BalaBalaYi · 2024-11-27T06:25:17Z

...r/python/diagnosis/inferencechain/inferenceoperator/resolver/diagnose_gpu_errors_operator.py

+)
+
+
+class DiagnoseGPUErrorsOperator(InferenceOperator):


The name should not be 'DiagnoseXXX' if it is a resolver.

BalaBalaYi · 2024-11-27T06:26:20Z

dlrover/python/elastic_agent/context.py

@@ -25,10 +21,10 @@

 class AgentContext(Singleton):
    def __init__(self):
-        self._worker_spec: Optional[WorkerSpec] = None


Why remove the class spec?

BalaBalaYi · 2024-11-27T06:31:16Z

dlrover/python/elastic_agent/diagnosis/diagnosis_agent.py

+        observe_problems = self._observe([inf])
+        action = self.diagnose_problems(observe_problems)
+        if isinstance(action, NodeAction):
+            return


Why 'return' here?

BalaBalaYi · 2024-11-27T06:34:30Z

dlrover/python/elastic_agent/monitor/resource.py

@@ -93,6 +96,7 @@ def __init__(self):
        self._gpu_enabled = False
        self._gpu_stats: list[GPUStats] = []
        self._master_client = MasterClient.singleton_instance()
+        self._diagnosis_agent = DiagnosisAgent.singleton_instance()


We should create a 'operator' to replace the current ResourceMonitor instead of giving ResourceMonitor the DiagnosisAgent.

BalaBalaYi · 2024-11-27T06:49:11Z

dlrover/python/elastic_agent/torch/training.py

+        if isinstance(action, NoAction):
+            return
+        self._process_diagnosis_action(action)
+        # avoid to execute the same event action too frequently


need more annotations here to explain the logic

BalaBalaYi · 2024-11-27T06:51:41Z

dlrover/python/elastic_agent/torch/training.py

+            time_diff = timestamp_diff_in_seconds(
+                action.timestamp, datetime.now().timestamp()
+            )
+            expired_time_period = action.expired_time_period - time_diff


should take care of 'expired_time_period <= 0'

BO SANG added 2 commits November 14, 2024 16:52

handle gpu lost in resource monitor

076e5aa

handle gpu lost in resource monitor

6dde15a

samplise requested review from workingloong, BalaBalaYi and majieyue as code owners November 15, 2024 00:54

BO SANG added 2 commits November 14, 2024 16:57

fix pre-commit

de42c72

resovle conflict

9c9014a

BO SANG added 2 commits November 14, 2024 18:58

make comptiable with tf

be5c230

make comptiable with tf

a52be14

BalaBalaYi added this to the v0.4.0 milestone Nov 18, 2024

BO SANG added 2 commits November 19, 2024 11:48

Merge remote-tracking branch 'origin/master' into report-gpu-lost

d582ff2

handle resource monitor error

42d86bb

samplise changed the title ~~WIP: handle GPU lost in resource monitor~~ Handle GPU lost in resource monitor Nov 19, 2024

BO SANG added 5 commits November 20, 2024 12:02

handle resource monitor error

eb5eb31

handle resource monitor error

cfe428b

handle resource monitor error

0a02d76

merge master

8dca3df

merge master

7eb8b72

BalaBalaYi reviewed Nov 27, 2024

View reviewed changes

BalaBalaYi requested changes Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle GPU lost in resource monitor #1335

Handle GPU lost in resource monitor #1335

samplise commented Nov 15, 2024 •

edited

Loading

codecov bot commented Nov 15, 2024 •

edited

Loading

BalaBalaYi Nov 27, 2024

BalaBalaYi Nov 27, 2024

BalaBalaYi Nov 27, 2024

BalaBalaYi Nov 27, 2024

BalaBalaYi Nov 27, 2024

BalaBalaYi Nov 27, 2024

Handle GPU lost in resource monitor #1335

Are you sure you want to change the base?

Handle GPU lost in resource monitor #1335

Conversation

samplise commented Nov 15, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

codecov bot commented Nov 15, 2024 • edited Loading

Codecov Report

BalaBalaYi Nov 27, 2024

Choose a reason for hiding this comment

BalaBalaYi Nov 27, 2024

Choose a reason for hiding this comment

BalaBalaYi Nov 27, 2024

Choose a reason for hiding this comment

BalaBalaYi Nov 27, 2024

Choose a reason for hiding this comment

BalaBalaYi Nov 27, 2024

Choose a reason for hiding this comment

BalaBalaYi Nov 27, 2024

Choose a reason for hiding this comment

samplise commented Nov 15, 2024 •

edited

Loading

codecov bot commented Nov 15, 2024 •

edited

Loading