Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix dashboard links in alertmanager and mimir rules #1367

Merged
merged 1 commit into from
Sep 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- Upgrade Alloy to 0.5.2 which brings no value to this repo.

### Fixed

- Dashboard links in alertmanager and mimir rules

## [4.15.2] - 2024-09-17

### Fixed
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ spec:
rules:
- alert: AlertmanagerNotifyNotificationsFailing
annotations:
dashboard: alertmanager-overview/alertmanager-overview
description: '{{`AlertManager {{ $labels.integration }} notifications are failing.`}}'
opsrecipe: alert-manager-notifications-failing/
# Interval = 20m because currently AlertManager config set `group_interval=15m` that means that if a notification fails, it will be retried after 15m
Expand All @@ -21,21 +22,20 @@ spec:
for: 45m
labels:
area: platform
dashboard: alertmanager-overview/alertmanager-overview
severity: page
team: atlas
topic: monitoring
cancel_if_outside_working_hours: "true"
- alert: AlertmanagerPageNotificationsFailing
annotations:
dashboard: alertmanager-overview/alertmanager-overview
description: '{{`AlertManager {{ $labels.integration }} notifications are failing.`}}'
opsrecipe: alert-manager-notifications-failing/
# Here, we decide to notify after 2 successive failures (opsgenie notification), so we need to wait 2*15m = 30m before notifying.
expr: rate(alertmanager_notifications_failed_total{integration="opsgenie", cluster_type="management_cluster"}[20m]) > 0
for: 30m
labels:
area: platform
dashboard: alertmanager-overview/alertmanager-overview
severity: notify
team: atlas
topic: monitoring
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ spec:
# This alert will not page for the prometheus-buddy.
- alert: MimirRestartingTooOften
annotations:
dashboard: ffcd83628d7d4b5a03d1cafd159e6c9c/mimir-overview
description: '{{`Mimir containers are restarting too often.`}}'
opsrecipe: mimir/
expr: |
Expand All @@ -41,12 +42,12 @@ spec:
# This label is used to ensure the alert go through even for non-stable installations
all_pipelines: "true"
cancel_if_outside_working_hours: "true"
dashboard: ffcd83628d7d4b5a03d1cafd159e6c9c/mimir-overview
severity: page
team: atlas
topic: observability
- alert: MimirComponentDown
annotations:
dashboard: ffcd83628d7d4b5a03d1cafd159e6c9c/mimir-overview
description: '{{`Mimir component : {{ $labels.service }} is down.`}}'
opsrecipe: mimir/
expr: count(up{job=~"mimir/.*", container!="prometheus"} == 0) by (cluster_id, installation, provider, pipeline, service) > 0
Expand All @@ -57,7 +58,6 @@ spec:
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cancel_if_outside_working_hours: "true"
dashboard: ffcd83628d7d4b5a03d1cafd159e6c9c/mimir-overview
severity: page
team: atlas
topic: observability
Expand All @@ -78,6 +78,7 @@ spec:
topic: observability
- alert: MimirRulerEventsFailed
annotations:
dashboard: 631e15d5d85afb2ca8e35d62984eeaa0/mimir-ruler
description: 'Mimir ruler is failing to process PrometheusRules.'
opsrecipe: mimir/
expr: rate(mimir_rules_events_failed_total{cluster_type="management_cluster", namespace="mimir"}[5m]) > 0
Expand All @@ -88,7 +89,6 @@ spec:
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cancel_if_outside_working_hours: "true"
dashboard: 631e15d5d85afb2ca8e35d62984eeaa0/mimir-ruler
severity: page
team: atlas
topic: observability
Expand Down Expand Up @@ -168,6 +168,7 @@ spec:
topic: observability
- alert: MimirCompactorFailedCompaction
annotations:
dashboard: 09a5c49e9cdb2f2b24c6d184574a07fd/mimir-compactor-resources
description: 'Mimir compactor has been failing its compactions for 2 hours.'
opsrecipe: mimir/
# Query is based on the following upstream mixin alerting rule : https://github.com/grafana/mimir/blob/main/operations/mimir-mixin-compiled/alerts.yaml#L858
Expand All @@ -178,7 +179,6 @@ spec:
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cancel_if_outside_working_hours: "true"
dashboard: 09a5c49e9cdb2f2b24c6d184574a07fd/mimir-compactor-resources
severity: page
team: atlas
topic: observability
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,11 +79,11 @@ tests:
cancel_if_cluster_status_updating: "true"
cancel_if_outside_working_hours: "true"
cluster_id: gauss
dashboard: ffcd83628d7d4b5a03d1cafd159e6c9c/mimir-overview
installation: gauss
provider: aws
pipeline: testing
exp_annotations:
dashboard: ffcd83628d7d4b5a03d1cafd159e6c9c/mimir-overview
description: "Mimir component : mimir-ingester is down."
opsrecipe: "mimir/"
- interval: 1m
Expand Down Expand Up @@ -134,13 +134,13 @@ tests:
cancel_if_cluster_status_updating: "true"
cluster_id: golem
cluster_type: management_cluster
dashboard: 631e15d5d85afb2ca8e35d62984eeaa0/mimir-ruler
installation: golem
namespace: mimir
severity: page
team: atlas
topic: observability
exp_annotations:
dashboard: 631e15d5d85afb2ca8e35d62984eeaa0/mimir-ruler
description: "Mimir ruler is failing to process PrometheusRules."
opsrecipe: "mimir/"
- alertname: MimirRulerEventsFailed
Expand All @@ -163,12 +163,12 @@ tests:
cancel_if_outside_working_hours: "true"
cluster_type: management_cluster
container: mimir-ingester
dashboard: ffcd83628d7d4b5a03d1cafd159e6c9c/mimir-overview
namespace: mimir
severity: page
team: atlas
topic: observability
exp_annotations:
dashboard: ffcd83628d7d4b5a03d1cafd159e6c9c/mimir-overview
description: Mimir containers are restarting too often.
opsrecipe: "mimir/"
- alertname: MimirRestartingTooOften
Expand Down Expand Up @@ -405,7 +405,6 @@ tests:
cancel_if_cluster_status_updating: "true"
cancel_if_outside_working_hours: "true"
cluster_id: golem
dashboard: 09a5c49e9cdb2f2b24c6d184574a07fd/mimir-compactor-resources
installation: "golem"
pipeline: "testing"
provider: "capa"
Expand All @@ -414,6 +413,7 @@ tests:
team: atlas
topic: observability
exp_annotations:
dashboard: 09a5c49e9cdb2f2b24c6d184574a07fd/mimir-compactor-resources
description: Mimir compactor has been failing its compactions for 2 hours.
opsrecipe: "mimir/"
- alertname: MimirCompactorFailedCompaction
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,12 @@ tests:
area: platform
cancel_if_outside_working_hours: "true"
cluster_type: management_cluster
dashboard: alertmanager-overview/alertmanager-overview
integration: slack
severity: page
team: atlas
topic: monitoring
exp_annotations:
dashboard: alertmanager-overview/alertmanager-overview
description: "AlertManager slack notifications are failing."
opsrecipe: alert-manager-notifications-failing/
- alertname: AlertmanagerNotifyNotificationsFailing
Expand All @@ -52,12 +52,12 @@ tests:
- exp_labels:
area: platform
cluster_type: management_cluster
dashboard: alertmanager-overview/alertmanager-overview
integration: opsgenie
severity: notify
team: atlas
topic: monitoring
exp_annotations:
dashboard: alertmanager-overview/alertmanager-overview
description: "AlertManager opsgenie notifications are failing."
opsrecipe: alert-manager-notifications-failing/
- alertname: AlertmanagerPageNotificationsFailing
Expand Down