[Chrome deployment] Split utasks properly into success and error, add analyze step to testcase triage metric #4558

vitorguidi · 2024-12-26T14:31:47Z

This merges #4547 and #4516 to the chrome branch.

…stuck in analyze (#4547) ### Motivation We currently have no way to tell if analyze task was successfully executed. The TESTCASE_UPLOAD_TRIAGE_DURATION metric from #4364 would only track duration for tasks that did finish. An analyze_pending field is added to the Testcase entity in datastore, which is set to False by default, to True for manually uploaded testcases, and to False once analyze task postprocess runs. It also increments the UNTRIAGED_TESTCASE_AGE metric from #4381 with a status label, so we can know at what step the testcase is stuck, thus allowing us to alert if analyze is taking longer to finish than expected. The alert itself could be, for instance, P50 age of untriaged testcase (status=analyze_pending) > 3h. Also, this retroactively addresses comments from #4481: * Fixes docstring for emit_testcase_triage_duration_metric * Removes assertions * Renames TESTCASE_UPLOAD_TRIAGE_DURATION to TESTCASE_TRIAGE_DURATION, since it now accounts for fuzzer generated testcases * Use a boolean "from_fuzzer" field, instead of "origin" string, in TESTCASE_TRIAGE_DURATION

#4516) ### Motivation #4458 implemented an error rate for utasks, only considering exceptions. In #4499 , outcomes were split between success, failure and maybe_retry conditions. There we learned that the volume of retryable outcomes is negligible, so it makes sense to count them as failures. Listing out all the success conditions under _MetricRecorder is not desirable. However, we are consciously taking this technical debt so we can deliver #4271 . A refactor of uworker main will be later performed, so we can split the success and failure conditions, both of which are mixed in uworker_output.ErrorType. Reference for tech debt acknowledgement: #4517

jonathanmetzman · 2024-12-26T15:04:02Z

src/clusterfuzz/_internal/datastore/data_types.py

@@ -583,6 +583,10 @@ class Testcase(Model):
  # Tracks if a testcase is stuck during triage.
  stuck_in_triage = ndb.BooleanProperty(default=False)

+  # Tracks if analyze task is pending.


Do we want to track if just analyze is pending and not analyze/minimize/regressoin?

Minimize/regression/impact are tracked together under critical tasks. The idea for spliting analyze, is so we would be able to catch the infamous pubsub incident. Also, there is no reason analyze would fail: the testcase either reproduces or not, if it does not complete, then there is trouble, as opposed to minimize, where there are legitimate reasons for it to get stuck.

Since we already have quite a few stuck testcases under critical tasks, we do not gain much by splitting until we clean up the old stuff.

jonathanmetzman · 2024-12-26T15:16:20Z

lgtm

vitorguidi added 2 commits December 26, 2024 14:19

vitorguidi requested review from alhijazi and jonathanmetzman December 26, 2024 14:31

vitorguidi changed the title ~~Split utasks properly into success and error, add analyze step to testcase triage metric~~ [Chrome deployment] Split utasks properly into success and error, add analyze step to testcase triage metric Dec 26, 2024

vitorguidi merged commit 5655b43 into chrome Dec 26, 2024
3 checks passed

vitorguidi deleted the chore/chrome-monitoring-final branch December 26, 2024 14:56

jonathanmetzman reviewed Dec 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Chrome deployment] Split utasks properly into success and error, add analyze step to testcase triage metric #4558

[Chrome deployment] Split utasks properly into success and error, add analyze step to testcase triage metric #4558

vitorguidi commented Dec 26, 2024

jonathanmetzman Dec 26, 2024

vitorguidi Dec 26, 2024 •

edited

Loading

jonathanmetzman commented Dec 26, 2024

[Chrome deployment] Split utasks properly into success and error, add analyze step to testcase triage metric #4558

[Chrome deployment] Split utasks properly into success and error, add analyze step to testcase triage metric #4558

Conversation

vitorguidi commented Dec 26, 2024

jonathanmetzman Dec 26, 2024

Choose a reason for hiding this comment

vitorguidi Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

jonathanmetzman commented Dec 26, 2024

vitorguidi Dec 26, 2024 •

edited

Loading