generated from giantswarm/template-app
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add alerts for alloy-metrics #1417
Merged
Merged
Changes from all commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
96364d1
add sensible alerts for alloy
QuentinBisson 9c2f655
wip - add ongoing alerts
QuentinBisson 2f24521
Merge branch 'main' into alloy-monitoring
QuentinBisson 55078dd
add dashboard annotation
QuentinBisson c111250
Merge branch 'main' into alloy-monitoring
QuentinBisson 759ae7f
Update helm/prometheus-rules/templates/platform/atlas/alerting-rules/…
QuentinBisson 9cee93a
Update prometheus.rules.yml
QuentinBisson 14d67c3
Update helm/prometheus-rules/templates/platform/atlas/alerting-rules/…
QuentinBisson b9c1dea
Update helm/prometheus-rules/templates/platform/atlas/alerting-rules/…
QuentinBisson 09929b0
Update helm/prometheus-rules/templates/platform/atlas/alerting-rules/…
QuentinBisson 40452a5
add missing tests
QuentinBisson fbc9c8d
change based on ops-recipes
QuentinBisson a9713a5
Merge branch 'main' into alloy-monitoring
QuentinBisson 9e72664
Clean up some rules a bit
QuentinBisson b7d53b3
Update CHANGELOG.md
QuentinBisson 1d49161
Update helm-operations.rules.yml
QuentinBisson 868779e
Update systemd.rules.yml
QuentinBisson c15aab7
Update helm/prometheus-rules/templates/platform/atlas/alerting-rules/…
QuentinBisson 393738d
Update helm/prometheus-rules/templates/platform/atlas/alerting-rules/…
QuentinBisson 54f9f72
Update helm/prometheus-rules/templates/platform/honeybadger/alerting-…
QuentinBisson bb9abda
Update helm/prometheus-rules/templates/platform/honeybadger/alerting-…
QuentinBisson 068d45d
Update test/tests/providers/global/platform/honeybadger/alerting-rule…
QuentinBisson 79dbfda
Merge branch 'alloy-monitoring' into alloy-monitoring-monitoring
QuentinBisson 2f9c07c
add alerts for alloy-metrics
QuentinBisson 263f93b
Merge branch 'alloy-monitoring-monitoring' into alerts-for-alloy-metrics
QuentinBisson 70095ef
Merge branch 'main' into alerts-for-alloy-metrics
QuentinBisson 4f9e241
improve monitoring agent down tests
QuentinBisson 553a1a4
improve monitoring agent shards not satisfied tests
QuentinBisson c7b460b
Update test/tests/providers/global/platform/atlas/alerting-rules/allo…
QuentinBisson b476eac
Update test/tests/providers/global/platform/atlas/alerting-rules/allo…
QuentinBisson File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
80 changes: 80 additions & 0 deletions
80
helm/prometheus-rules/templates/platform/atlas/alerting-rules/monitoring-pipeline.rules.yml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
apiVersion: monitoring.coreos.com/v1 | ||
kind: PrometheusRule | ||
metadata: | ||
labels: | ||
{{- include "labels.common" . | nindent 4 }} | ||
name: monitoring-pipeline.rules | ||
namespace: {{ .Values.namespace }} | ||
spec: | ||
groups: | ||
- name: monitoring-pipeline | ||
rules: | ||
- alert: MetricForwardingErrors | ||
annotations: | ||
description: '{{`Monitoring agent can''t communicate with Remote Storage API at {{ $labels.url }}.`}}' | ||
opsrecipe: monitoring-pipeline/ | ||
dashboard: promRW001/prometheus-remote-write | ||
expr: |- | ||
rate(prometheus_remote_storage_samples_failed_total[10m]) > 0.1 | ||
or rate(prometheus_remote_storage_samples_total[10m]) == 0 | ||
or rate(prometheus_remote_storage_metadata_retried_total[10m]) > 0 | ||
for: 1h | ||
labels: | ||
area: platform | ||
cancel_if_outside_working_hours: "true" | ||
severity: page | ||
team: atlas | ||
topic: observability | ||
- alert: JobScrapingFailure | ||
annotations: | ||
dashboard: servicemonitors-details/servicemonitors-details | ||
description: '{{`Monitoring agents for cluster {{$labels.installation}}/{{$labels.cluster_id}} has failed to scrape all targets in {{$labels.job}} job.`}}' | ||
summary: Monitoring agent failed to scrape all targets in a job. | ||
opsrecipe: monitoring-job-scraping-failure/ | ||
hervenicol marked this conversation as resolved.
Show resolved
Hide resolved
|
||
expr: |- | ||
( | ||
count(up == 0) by (job, installation, cluster_id, provider, pipeline) | ||
/ | ||
count(up) by (job, installation, cluster_id, provider, pipeline) | ||
) >= 1 | ||
for: 1d | ||
labels: | ||
area: platform | ||
severity: notify | ||
team: atlas | ||
topic: observability | ||
cancel_if_outside_working_hours: "true" | ||
- alert: CriticalJobScrapingFailure | ||
annotations: | ||
dashboard: servicemonitors-details/servicemonitors-details | ||
description: '{{`Monitoring agents for cluster {{$labels.installation}}/{{$labels.cluster_id}} has failed to scrape all targets in {{$labels.job}} job.`}}' | ||
summary: Monitoring agent failed to scrape all targets in a job. | ||
opsrecipe: monitoring-job-scraping-failure/ | ||
## We ignore bastion hosts node exporters | ||
expr: |- | ||
( | ||
count( | ||
( | ||
up{job=~".*(apiserver|kube-controller-manager|kube-scheduler|node-exporter|kube-state-metrics).*"} | ||
or | ||
up{job="kubelet", metrics_path="/metrics"} | ||
) == 0 | ||
) by (job, installation, cluster_id, provider, pipeline) | ||
/ | ||
count( | ||
up{job=~".*(apiserver|kube-controller-manager|kube-scheduler|node-exporter|kube-state-metrics).*"} | ||
or | ||
up{job="kubelet", metrics_path="/metrics"} | ||
) by (job, installation, cluster_id, provider, pipeline) | ||
) >= 1 | ||
for: 3d | ||
labels: | ||
area: platform | ||
severity: page | ||
team: atlas | ||
topic: observability | ||
cancel_if_outside_working_hours: "true" | ||
cancel_if_cluster_is_not_running_monitoring_agent: "true" | ||
cancel_if_cluster_status_creating: "true" | ||
cancel_if_cluster_status_deleting: "true" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no unit test for this one and its associated inhibition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have forgotten them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added alerts for those :)