[Monitoring] Adding a metric for task outcome #4458

vitorguidi · 2024-11-27T18:10:18Z

Motivation

We currently have no metric that tracks the error rate for each task. This PR implements that, and the error rate can be obtained by summing up the metric with outcome=failure, divided by the overall sum.

This is useful for SLI alerting.

Part of #4271

jonathanmetzman

utask_main is run with utasks.uworker_bot_main You might want to catch this too.

src/clusterfuzz/_internal/bot/tasks/commands.py

jonathanmetzman

lgtm

src/clusterfuzz/_internal/bot/tasks/utasks/__init__.py

letitz · 2024-12-09T09:22:49Z

src/clusterfuzz/_internal/bot/tasks/utasks/__init__.py

+    # failure.
+    outcome = 'error' if _exc_type or self.saw_failure else 'success'
+    monitoring_metrics.TASK_OUTCOME_COUNT.increment({
+        **self._labels, 'outcome': outcome


Since we have a nice enum of possible errors, I'd much rather we be more specific here and record it as the outcome: e.g. result: 'NO_ERROR' or result: 'BUILD_SETUP_FAILED'. It would greatly help in debugging problems.

Exceptions can be bucketed as their own result: 'UNHANDLED_EXCEPTION'.

Do you think it will pose a problem with metric cardinality?

Hmm now that I think of it, we should get rid of job

For oss fuzz:

15 tasks
1300 projects ~ 1300 jobs
3 subtasks (pre main post)
2 modes (batch/queue)
3 platforms (linux mac win)
2 outcomes (fail/success)

This goes for around 700k, which will bite us. We could swap job out for all outcomes, and that will be around ~16k possible label combinations, which should be fine. Wdyt?

@jonathanmetzman the current state of the subtask duration metric allows for around 350k different label combinations, GCP's recommendation is at most 30k. Should we get rid of job in this context manager as a whole?

What's the limit? I think there's other things that can be removed

I think modes is definitely not needed, the two modes won't live side by side much longer.

I think platform may be unneeded, it can be obtained through the job name.
Would reducing cardinality by 6X help?

Also, there's currently about 1000 platforms in oss-fuzz but that is going away. Will that help?

It is actually 40 distinct values under that enum.
15 tasks * 40 outcomes * 1300 jobs * 3 subtasks > 1M, so yeah, no chance at all to get the discriminated outcomes, while keeping jobs.

Another option is to spin a separate metric, TASK_OUTCOME_BY_ERROR_TYPE, for which we track all the labels except job, and manage to stay under 30k distinct labels.

Wdyt @letitz

As far as converting the proto value to a name goes, it can be done like this:

uworker_msg_pb2.ErrorType.Name(uworker_msg_pb2.ErrorType.NO_ERROR) 'NO_ERROR'

Sorry why 40 outcomes?

clusterfuzz/src/clusterfuzz/_internal/protos/uworker_msg.proto

Line 355 in 06173d2

NO_ERROR = 0;

38 ErrorTypes from the utask_main output, 'N/A' (to indicate no error), and 'UNHANDLED_EXCEPTION', as suggested by titouan

FIne with me to have a separate metric for error types that does not group by job.

I think modes is definitely not needed, the two modes won't live side by side much longer.
I think platform may be unneeded, it can be obtained through the job name.
Would reducing cardinality by 6X help?

Mode does not contribute much to cardinality since we have a mostly static mapping from (task, platform) to mode. Platform is actually free since as you point out, it is implied by job. So there are not 3 copies of each job, one for each platform. IIRC this is what actually matters, not the potential cardinality if we generated all possible labels.

src/clusterfuzz/_internal/bot/tasks/utasks/__init__.py

…in outcome

vitorguidi · 2024-12-10T23:44:51Z

Motivation

We currently have no metric that tracks the error rate for each task. This PR implements that, and the error rate can be obtained by summing up the metric with outcome=failure, divided by the overall sum.

This is useful for SLI alerting.

Part of #4271

Merging this for the sake of quick iteration, we can revisit this if folks feel like it is necessary

### Motivation We currently have no metric that tracks the error rate for each task. This PR implements that, and the error rate can be obtained by summing up the metric with outcome=failure, divided by the overall sum. This is useful for SLI alerting. Part of #4271

### Motivation This merges #4489, #4458 and #4483 to the chrome temporary deployment branch The purpose is to have task error rate metrics, and log what old testcases are polluting the testcase upload metrics, so we can figure out if a purge is necessary --------- Co-authored-by: jonathanmetzman <[email protected]>

jonathanmetzman · 2024-12-11T05:40:34Z

I think this metric needs some temporary reworking.

As is, it's not very useful because it is too close to 100% error rate because ClusterFuzz is treating expected misbehavior (bad builds) as exceptional when they are not.
I think until ClusterFuzz is refactored, this metric would be more useful if it only measured exceptions and not handled errors.

letitz · 2024-12-12T12:16:36Z

It may not be useful for alerting, but I for one find this data interesting. I would not have assumed that most tasks would end in "temporary failure".

Maybe that's the breakdown we want though here? "success", "temporary error"/"retry" and "failure"?

vitorguidi · 2024-12-12T12:50:15Z

This is too complex IMO, I agree with Jonathan's approach of only considering UNHANDLED_EXCEPTION as failure.
We can probably get rid of outcome and only emit this 'error_condition' type, which will be there for analysis when we are troubleshooting.

Breaking down by success/retry/failure is too janky, as we would have to map these 38 ErrorType enums to 3 sets, leading to polluting the codebase and make things hard to understand. Also, for every new ErrorType added, this metric would have to be updated, adding further friction to development.

I would much rather treat only UNHANDLED_EXCEPTION as an error, since it will not lead to false positives, and simplify things.

letitz · 2024-12-12T12:57:46Z

Breaking down by success/retry/failure is too janky, as we would have to map these 38 ErrorType enums to 3 sets, which would lead to polluting the codebase and make things hard to understand. Also, for every new ErrorType added, this metric would have to be updated, adding further friction to development.

I would have thought the opposite. Right now, if I want to know what it means for regression task to fail with REGRESSION_NO_CRASH, then I have to go read the code. If I could see easily that a certain error just means we'll retry the task, I would find the codebase easier to reason through. That said, this reminds me that errors don't map cleanly to retrying or not, such as when we notice a flake and might or might not retry depending on whether it's a first offense.

Ultimately I would like to know when there are jobs whose tasks are failing too often, so I can go investigate.

vitorguidi · 2024-12-12T13:40:06Z

Breaking down by success/retry/failure is too janky, as we would have to map these 38 ErrorType enums to 3 sets, which would lead to polluting the codebase and make things hard to understand. Also, for every new ErrorType added, this metric would have to be updated, adding further friction to development.

I would have thought the opposite. Right now, if I want to know what it means for regression task to fail with REGRESSION_NO_CRASH, then I have to go read the code. If I could see easily that a certain error just means we'll retry the task, I would find the codebase easier to reason through. That said, this reminds me that errors don't map cleanly to retrying or not, such as when we notice a flake and might or might not retry depending on whether it's a first offense.

Ultimately I would like to know when there are jobs whose tasks are failing too often, so I can go investigate.

We can partition these errors on a best effort basis in three sets:

things that unequivocally imply success (ie, NO_ERROR and ANALYZE_DOES_NOT_REPRODUCE),
things for which there MIGHT BE a retry (ie, PROGRESSION_TIMEOUT, MINIMIZE_DEADLINE_EXCEEDED),
things that unequivocally imply failure (ie, FUZZ_DATA_BUNDLE_SETUP_FAILURE, unhandled exceptions).

The above would belong in the TASK_OUTCOME_COUNT, under the 'outcome' label:

success
potential_retry
failure

Wdyt? This solves the problem, it will at least be possible to filter for the unambiguous failures, and those will be the jobs to be further drilled down.

letitz · 2024-12-12T13:49:32Z

Fine with me! @alhijazi might still want to have a say

vitorguidi · 2024-12-13T02:10:11Z

Fine with me! @alhijazi might still want to have a say

Given the 30k different label limitation, we will not be able to drill down by ErrorType and job simultaneously:

38 error types + 'N/A' + 'UNHANDLED_EXCEPTION' as error_count
1300+ jobs in oss-fuzz

52k>30k, so this is as far as we can go. Restating: drilling down per job and ErrorType is an impossible requirement.

…ss, maybe_retry and failure outcomes (#4499) ### Motivation #4458 implemented a task outcome metric, so we can track error rates in utasks, by job/task/subtask. As failures are expected for ClusterFuzz, initially only unhandled exceptions would be considered as actual errors. Chrome folks asked for a better partitioning of error codes, which is implemented here as the following outcomes: * success: the task has unequivocally succeeded, producing a sane result * maybe_retry: some transient error happened, and the task is potentially being retried. This might capture some unretriable failure condition, but it is a compromise we are willing to make in order to decrease false positives. * failure: the task has unequivocally failed. Part of #4271

alhijazi · 2024-12-18T14:48:12Z

Fine with me! @alhijazi might still want to have a say

Fine with me also!

#4516) ### Motivation #4458 implemented an error rate for utasks, only considering exceptions. In #4499 , outcomes were split between success, failure and maybe_retry conditions. There we learned that the volume of retryable outcomes is negligible, so it makes sense to count them as failures. Listing out all the success conditions under _MetricRecorder is not desirable. However, we are consciously taking this technical debt so we can deliver #4271 . A refactor of uworker main will be later performed, so we can split the success and failure conditions, both of which are mixed in uworker_output.ErrorType. Reference for tech debt acknowledgement: #4517

vitorguidi requested review from alhijazi, jonathanmetzman and oliverchang November 27, 2024 18:10

jonathanmetzman reviewed Nov 27, 2024

View reviewed changes

src/clusterfuzz/_internal/bot/tasks/commands.py Outdated Show resolved Hide resolved

vitorguidi added 4 commits December 9, 2024 04:44

Adding task outcome metric definition

3b9397f

Emiting metric outcome metric

7c07642

Fix lint

dc8920f

Moving task outcome to correct position

53cdc9e

vitorguidi force-pushed the feature/task-error-rate branch from 872273a to 53cdc9e Compare December 9, 2024 04:45

vitorguidi added 3 commits December 9, 2024 05:06

Move task outcome metric emission to the utasks module

b9f8aa4

Reusing _MetricRecorder to emit the task outcome metric

49844d6

Emiting metric in place, fixing lint

0ae1f43

vitorguidi requested a review from jonathanmetzman December 9, 2024 05:29

jonathanmetzman approved these changes Dec 9, 2024

View reviewed changes

vitorguidi commented Dec 9, 2024

View reviewed changes

src/clusterfuzz/_internal/bot/tasks/utasks/__init__.py Outdated Show resolved Hide resolved

Accounting for when utask main fails but does not throw

c773f9a

vitorguidi requested a review from jonathanmetzman December 9, 2024 05:59

Fix lint

f4ed2a3

letitz reviewed Dec 9, 2024

View reviewed changes

vitorguidi added 2 commits December 9, 2024 23:35

Conform to python style guide

390ec9f

Add TASK_COUNT_BY_ERROR_TYPE metric, so we can drill down by utask ma…

ce4fb93

…in outcome

vitorguidi requested a review from letitz December 10, 2024 00:12

vitorguidi added 2 commits December 10, 2024 00:18

Fix lint

30b67a4

Fix lint

b83e12f

vitorguidi merged commit 514cec0 into master Dec 10, 2024
7 checks passed

vitorguidi deleted the feature/task-error-rate branch December 10, 2024 23:44

vitorguidi mentioned this pull request Dec 11, 2024

Merging task outcome metric and other minutia to chrome #4491

Merged

vitorguidi mentioned this pull request Dec 13, 2024

[Monitoring] Partition uworker_output.ErrorType conditions into success, maybe_retry and failure outcomes #4499

Merged

vitorguidi mentioned this pull request Dec 17, 2024

[Monitoring] Partition UTask outcomes correctly into success and error #4516

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitoring] Adding a metric for task outcome #4458

[Monitoring] Adding a metric for task outcome #4458

vitorguidi commented Nov 27, 2024 •

edited

Loading

jonathanmetzman left a comment

jonathanmetzman left a comment

letitz Dec 9, 2024

vitorguidi Dec 9, 2024

vitorguidi Dec 9, 2024

jonathanmetzman Dec 9, 2024

jonathanmetzman Dec 9, 2024

vitorguidi Dec 9, 2024 •

edited

Loading

jonathanmetzman Dec 10, 2024

vitorguidi Dec 10, 2024

letitz Dec 12, 2024

vitorguidi commented Dec 10, 2024

Motivation

jonathanmetzman commented Dec 11, 2024

letitz commented Dec 12, 2024

vitorguidi commented Dec 12, 2024 •

edited

Loading

letitz commented Dec 12, 2024

vitorguidi commented Dec 12, 2024 •

edited

Loading

letitz commented Dec 12, 2024

vitorguidi commented Dec 13, 2024 •

edited

Loading

alhijazi commented Dec 18, 2024

[Monitoring] Adding a metric for task outcome #4458

[Monitoring] Adding a metric for task outcome #4458

Conversation

vitorguidi commented Nov 27, 2024 • edited Loading

Motivation

jonathanmetzman left a comment

Choose a reason for hiding this comment

jonathanmetzman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitorguidi Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitorguidi commented Dec 10, 2024

Motivation

jonathanmetzman commented Dec 11, 2024

letitz commented Dec 12, 2024

vitorguidi commented Dec 12, 2024 • edited Loading

letitz commented Dec 12, 2024

vitorguidi commented Dec 12, 2024 • edited Loading

letitz commented Dec 12, 2024

vitorguidi commented Dec 13, 2024 • edited Loading

alhijazi commented Dec 18, 2024

vitorguidi commented Nov 27, 2024 •

edited

Loading

vitorguidi Dec 9, 2024 •

edited

Loading

vitorguidi commented Dec 12, 2024 •

edited

Loading

vitorguidi commented Dec 12, 2024 •

edited

Loading

vitorguidi commented Dec 13, 2024 •

edited

Loading