Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Monitoring] Create a dashboard for overall clusterfuzz health (#4497)
### Motivation As a final step for the monitoring project, this PR creates a dashboard for overall system health. The reasoning behind it is to have: * Business level metrics (fuzzing hours, generated testcases, issues filed, issues closed, testcases for which processing is pending) * Testcase metrics (untriaged testcase age and count) * SQS metrics (queue size, and published messages, per topic) * Datastore/GCS metrics (number of requests, error rate, and latencies) * Utask level metrics (duration, number of executions, error rate, latency) These are sufficient to apply the [RED methodology](https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/) (rate, error and duration), provide high level metrics we can alert on, and aid in troubleshooting outages with a well defined methodology. There were two options to commit this to version control: terraform, or butler definitions. The first was chosen, since it is the preffered long term solution, and it is also simpler to implement, since it supports copy pasting the JSON definition from GCP. ### Attention points This should be automatically imported from main.tf, so it (should be) sufficient to just place the .tf file in the same folder, and have butler deploy handle the terraform apply step. ### How to review Head over to go/cf-chrome-health-beta, internally. It is not expected that the actual dashboard definition is reviewed, it is just a dump of what was manually created in GCP. Part of #4271
- Loading branch information