Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Chrome deployment] Split utasks properly into success and error, add analyze step to testcase triage metric #4558

Merged
merged 2 commits into from
Dec 26, 2024

Conversation

vitorguidi
Copy link
Collaborator

This merges #4547 and #4516 to the chrome branch.

…stuck in analyze (#4547)

### Motivation

We currently have no way to tell if analyze task was successfully
executed. The TESTCASE_UPLOAD_TRIAGE_DURATION metric from #4364 would
only track duration for tasks that did finish.

An analyze_pending field is added to the Testcase entity in datastore,
which is set to False by default, to True for manually uploaded
testcases, and to False once analyze task postprocess runs.

It also increments the UNTRIAGED_TESTCASE_AGE metric from #4381 with a
status label, so we can know at what step the testcase is stuck, thus
allowing us to alert if analyze is taking longer to finish than
expected.

The alert itself could be, for instance, P50 age of untriaged testcase
(status=analyze_pending) > 3h.

Also, this retroactively addresses comments from #4481:

* Fixes docstring for emit_testcase_triage_duration_metric
* Removes assertions
* Renames TESTCASE_UPLOAD_TRIAGE_DURATION to TESTCASE_TRIAGE_DURATION,
since it now accounts for fuzzer generated testcases
* Use a boolean "from_fuzzer" field, instead of "origin" string, in
TESTCASE_TRIAGE_DURATION
#4516)

### Motivation

#4458 implemented an error rate for utasks, only considering exceptions.
In #4499 , outcomes were split between success, failure and maybe_retry
conditions. There we learned that the volume of retryable outcomes is
negligible, so it makes sense to count them as failures.

Listing out all the success conditions under _MetricRecorder is not
desirable. However, we are consciously taking this technical debt so we
can deliver #4271 .

A refactor of uworker main will be later performed, so we can split the
success and failure conditions, both of which are mixed in
uworker_output.ErrorType.

Reference for tech debt acknowledgement: #4517
@vitorguidi vitorguidi changed the title Split utasks properly into success and error, add analyze step to testcase triage metric [Chrome deployment] Split utasks properly into success and error, add analyze step to testcase triage metric Dec 26, 2024
@vitorguidi vitorguidi merged commit 5655b43 into chrome Dec 26, 2024
3 checks passed
@vitorguidi vitorguidi deleted the chore/chrome-monitoring-final branch December 26, 2024 14:56
@@ -583,6 +583,10 @@ class Testcase(Model):
# Tracks if a testcase is stuck during triage.
stuck_in_triage = ndb.BooleanProperty(default=False)

# Tracks if analyze task is pending.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to track if just analyze is pending and not analyze/minimize/regressoin?

Copy link
Collaborator Author

@vitorguidi vitorguidi Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimize/regression/impact are tracked together under critical tasks. The idea for spliting analyze, is so we would be able to catch the infamous pubsub incident. Also, there is no reason analyze would fail: the testcase either reproduces or not, if it does not complete, then there is trouble, as opposed to minimize, where there are legitimate reasons for it to get stuck.

Since we already have quite a few stuck testcases under critical tasks, we do not gain much by splitting until we clean up the old stuff.

@jonathanmetzman
Copy link
Collaborator

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants