Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Temporary chrome deployment #4440

Open
wants to merge 47 commits into
base: master
Choose a base branch
from
Open

[DO NOT MERGE] Temporary chrome deployment #4440

wants to merge 47 commits into from

Conversation

vitorguidi
Copy link
Collaborator

No description provided.

paulsemel and others added 30 commits November 8, 2024 09:37
We are currently logging progression when unpacking, but we're missing
the last log. This can be useful to monitor what's going on in CF.
### Motivation

We currently lack metrics for build retrieval and unpacking times. This
PR adds that, with granularity by fuzz target and job type.

There are two different implementations for build downloading/unpacking:

- In the Build class, from which RegularBuild, SplitTargetBuild,
FuchsiaBuild and SymbolizedBuild inherit the downloading/unpacking
behavior
- In the CustomBuild class, which implements its own logic

There are two possible cases for downloading/unpacking: clusterfuzz
either downloads the whole build and unpacks it locally, or unpacks it
remotely. This is the case for all build types except CustomBuild, which
does not perform remote unpacking.

For build retrieval over http, we do not track download time. For all
the other cases, it suffices to keep track of start/finish time for
download and unpacking.

Finally, a _build_type is added to the constructor of the Build class,
from which all other inherit. This is used to track the build type
(debug or release), and is only mutated by SymbolizedBuild when
attempting to fetch a debug build.

Part of #4271
For non engine fuzzers, we do not need to have the full list of fuzz
targets, since those builds are only containing the APP_NAME. For that
reason, and when the build is fully unpacked, lazily fetch the fuzzing
targets only when requested by the user of the class.

This change will tremendously speed up the unpacking step for some of
our fuzzers, see
https://docs.google.com/document/d/1OepfVcuG2XNXLxgZIXVwgE-uOhX9RO5RfH0TSg_wmPU/.

Co-authored-by: jonathanmetzman <[email protected]>
### Motivation

We currently lack awareness on how old builds are during fuzz task. This
PR implements that, by making the assumption that the Last Update Time
metadata field in GCS is a good proxy for build age. [Documentation
reference](https://cloud.google.com/storage/docs/json_api/v1/objects#resource)

### Approach

Symbolized and custom builds do not matter, thus all builds of interest
will be fetched from ```build_manager.setup_regular_build```. Logic for
collecting all bucket paths and the latest revision was refactored, so
that ```setup_regular_build``` can also figure out the latest revision
for a given build and conditionally emit the proposed metric.

### Testing strategy

!Todo: test this for fuzz, analyze, progression

Locally ran tasks, with instructions from #4343 and #4345 , and verified
the _emmit_build_age_metric function gets invoked and produces sane
output.

Commands used:
```
fuzz libFuzzer libfuzzer_asan_log4j2
```

![image](https://github.com/user-attachments/assets/66937297-20ec-44cf-925e-0004a905c92e)

```
progression 4992158360403968 libfuzzer_asan_qt
```

![image](https://github.com/user-attachments/assets/0e1f1199-d1d8-4da5-814e-8d8409d1f806)

```
analyze 4992158360403968 libfuzzer_asan_qt (disclaimer: build revision was overriden mid flight to force a trunk build, since this testcase was already tied to a crash revision)
```

![image](https://github.com/user-attachments/assets/dd3d5a60-36a1-4a9e-a21b-b72177ffdecd)


Part of #4271
### Motivation

Small changes for the build retrieval metric, as requested per Chrome
folks after initially using the feature:

* Adding the time for listing fuzz targets as a step
* Added the total retrieval duration
* Tracking metric as minutes, for readability

Part of #4271
### Motivation

The Chrome team has no easy visibility into how many manually uploaded
test cases flake or successfully reproduce. This PR implements a counter
metric to track that.

There are three possible outcomes, each represented by a string label:
'reproduces', 'one_timer' and 'does_not_reproduce'

Part of #4271
…4381)

### Motivation


Once a testcase is generated (or manually uploaded), followup tasks
(analyze/progression) are started. This happens by publishing to a
pubsub queue, both for the manually uploaded case, and for the fuzzer
generated case.

If for any reason the messages are not processed, the testcase gets
stuck. To get better visibility into these stuck testcases, the
UNTRIAGED_TESTCASE_AGE metric is introduced, to pinpoint how old these
testcases that have not yet been triaged are(more precisely, gone
through analyze/regression/impact/progression tasks).


### Attention points

Testcase.timestamp mutates in analyze task:


https://github.com/google/clusterfuzz/blob/6ed80851ad0f6f624c5b232b0460c405f0a018b5/src/clusterfuzz/_internal/bot/tasks/utasks/analyze_task.py#L589

This makes it unreliable as a source of truth for testcase creation
time. To circumvent that, a new ```created``` field is added to the
Testcase entity, from which we can derive the correct creation time.

Since this new field will only apply for testcases created after this PR
is merged, Testcase.timestamp will be used instead to calculate the
testcase age when the new field is missing.

### Testing strategy

Ran the triage cron locally, and verified the codepath for the metric is
hit and produces sane output (reference testcase: 4505741036158976).

![image](https://github.com/user-attachments/assets/6281b44f-768a-417e-8ec1-763f132c8181)


Part of #4271
Chrome security shepherds manually upload testcases through appengine,
triggering analyze task and, in case of a legitimate crash, the followup
progression tasks:
* Minimize
* Analyze
* Impact
* Regression
* Cleanup cronjob, when updating a bug to inform the user that all above
stages were finished

This PR adds instrumentation to track the time elapsed between the user
upload, and the completion of the above events.

* TestcaseUploadMetadata.timestamp was being mutated on the preprocess
stage for analyze task. This mutation was removed, so that this entity
can be the source of truth for when a testcase was in fact uploaded by
the user.

* The job name could be retrieved from the JOB_NAME env var within the
uworker, however this does not work for the cleanup use case. For this
reason, the job name is fetched from datastore instead.

* The ```query_testcase_upload_metadata``` method was moved from
analyze_task.py to a helpers file, so it could be reused across tasks
and on the cleanup cronjob

Every task mentioned was executed locally, with a valid uploaded
testcase. The codepath for the metric emission was hit and produced the
desired output, both for the tasks and the cronjob.

Part of #4271
### Motivation

Some cumulative distribution metrics (build age, retrieval, testcase
age, testcase triage duration) are misbehaving and capping at 1. This PR
intends to aid in debugging that.
### Motivation

Cumulative distribution metrics from the monitoring initiative were
incorrectly set to use the fixed width bucketer, and/or width=0.05 and
max_buckets=20. This caused percentile metrics to cap at 1, which was
wrong behavior.

This PR attempts to fix that by moving them all to Geometric Bucketer,
without the aforementioned limits. It also reverts #4429 , since it
apparently broke triage.py in chrome.

Part of #4271
…#4414)

### Motivation

Chrome folks need to know how long on average a fuzzer takes to generate
a testcase. This PR implements that.

Part of #4271
…g filing (#4415)

### Motivation

As per Chrome request, it is desirable to know how long it takes for an
issue to be opened, from the moment a testcase is created.

Part of #4271
https://crbug.com/380707237 is caused by this discrepency between master
and chrome branches.
Cherry picking #4441 onto chrome temp branch

Co-authored-by: Ali HIJAZI <[email protected]>
…ng everything (#4486)

Now that we are lazily checking for fuzzing targets, it makes sense to
allow remote unpacking even when unpacking the full archive.
Furthermore, it seems that remote unpacking performances are much higher
than local unpacking on CF bots, so this might improve overall
performances of the build_manager.
### Motivation

This merges #4489, #4458 and #4483 to the chrome temporary deployment
branch

The purpose is to have task error rate metrics, and log what old
testcases are polluting the testcase upload metrics, so we can figure
out if a purge is necessary

---------

Co-authored-by: jonathanmetzman <[email protected]>
This merges #4494 into the temporary chrome deployment
…4498)

The metric for untriaged testcae age was not considering bugs that were
being filed legitimately, so there was no metric emission at all.

Also, removes granularity in the stuck testcase count metric.
Merging #4500 into the chrome branch, doing CI checks first
#4503)

Fixes the following error:

```
AttributeError
'ProtoType' object has no attribute 'DESCRIPTOR'
Failed to flush metrics: 'ProtoType' object has no attribute 'DESCRIPTOR' Traceback (most recent call last): File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/metrics/monitor.py", line 118, in _flush_metrics metric.monitoring_v3_time_series(series, labels, start_time, end_time, File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/metrics/monitor.py", line 350, in monitoring_v3_time_series self._set_value(point.value, value) File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/metrics/monitor.py", line 404, in _set_value point.int64_value = value ^^^^^^^^^^^^^^^^^ File "/mnt/scratch0/clusterfuzz/src/third_party/proto/message.py", line 935, in __setattr__ pb_value = marshal.to_proto(pb_type, value) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/scratch0/clusterfuzz/src/third_party/proto/marshal/marshal.py", line 229, in to_proto proto_type.DESCRIPTOR.has_options ^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'ProtoType' object has no attribute 'DESCRIPTOR'
```
Running CI checks with a PR prior to deployment
This adds 9 archives from Fuzzilli to the general test-input archives
used by fuzzers on Clusterfuzz. The Fuzzilli-side archives are refreshed
every few days. We'll add a freshness metric in a follow up.

This was tested locally with butler.

See bug: https://crbug.com/362963886
mi-ac and others added 17 commits December 20, 2024 10:21
There is a clear way to partition ErrorType enums in the criteria
proposed, reverting this one.

Part of #4271

Co-authored-by: Vitor Guidi <[email protected]>
Otherwise new issues may error out because it's a required field.

Cherry pick: #4379

Co-authored-by: Oliver Chang <[email protected]>
They are unsupported on the new tracker and aren't needed anyway.
Example: https://issues.oss-fuzz.com/issues/379066972#comment1
Cherrypick of #4409

Co-authored-by: maflcko <[email protected]>
In application logs for debugging and in testcase comments for user
understanding.

Bug: https://crbug.com/380264190

Cherry pick of #4467

Co-authored-by: Titouan Rigoudy <[email protected]>
Custom binaries are often used emulators (i.e. the binary being used by
the host).

This code follows the same strategy used for the ADB binary above in
`get_adb_path()`.

Cherry pick: #4518

Co-authored-by: Mark Teffeteller <[email protected]>
It seems like some projects are not getting pruning tasks. This might be
because we were not querying jobs correctly until this patch.
Cherry pick of #4423
This allows manipulating the weight of individual FuzzerJob entries in
the database.

Complements the `fuzz-target set` command.
Cherry pick: #4392

Co-authored-by: Titouan Rigoudy <[email protected]>
…#4537)

### Motivation

oss_fuzz_build_status and oss_fuzz_apply_ccs are broken due to buganizer
rejecting emails without associated accounts. This serves as a
workaround so at least some notifications can suceeed.
(google/oss-fuzz#12536)

### Why this works

Buganizer throws 400 errors listing out all rejected emails in the
exception, so we can just parse the non gaia emails from that and
complete the execution by ignoring those.

### Testing strategy

Ran both cronjobs locally with the proposed new code, and verified they
ran to completion.

Cherry pick of #4406

Co-authored-by: Vitor Guidi <[email protected]>
Allow instances to specify the GCS bucket location for data bundle
buckets in `project.yaml` as a new key: `data_bundle_bucket_location`.

This will allow creating regional buckets instead of using the default
`US` multi-region which results in high data transfer costs in Chrome's
instance.

Cherry pick of: #4479

Co-authored-by: Titouan Rigoudy <[email protected]>
There were some hardcoded constants that needed to be updated for the
Google Issue Tracker migration.

We'll also stop adding the 'Project' custom field for now, because it's
hard to propagate custom field values and it's not essential for build
failure issues.

This should hopefully fix
google/oss-fuzz#12536 (comment).

Cherry-pick of #4378

Co-authored-by: Oliver Chang <[email protected]>
Introduces regex-based filtering directly in the API to improve
performance and reduce number of calls to ab api server.

Cherry pick: #4297

Co-authored-by: aditya-wazir <[email protected]>
The device check is updated to use find instead of a literal match so
that sanitized version of the devices (e.g: cheetah_hwasan) can also be
used

---------

Cherry pick: #4256

Co-authored-by: svasudevprasad <[email protected]>
… analyze step to testcase triage metric (#4558)

This merges #4547 and #4516 to the chrome branch.
Merging #4530 into the Chrome
branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants