Make Atlas rules compatible with Mimir #1102

marieroque · 2024-04-04T14:11:21Z

Before adding a new alerting rule into this repository you should consider creating an SLO rules instead.
SLO helps you both increase the quality of your monitoring and reduce the alert noise.

How to create a SLO rule: https://github.com/giantswarm/sloth-rules#how-to-create-a-slo
Documentation: https://intranet.giantswarm.io/docs/monitoring/slo-alerting/

Towards: giantswarm/roadmap#3318

This PR makes Atlas rules compatible with Mimir by:

adding labels on aggregation functions cluster_id, installation, provider, pipeline
rewrite some of absent functions

Checklist

Update CHANGELOG.md
Add Unit tests
Follow Alert structure
Consider creating a dashboard (guidelines) (if it does not exist already) to help oncallers monitor the status of the issue.
Request review from oncall area, as well as team (e.g: oncall-kaas-cloud GitHub group).

helm/prometheus-rules/templates/alerting-rules/keda.rules.yml

helm/prometheus-rules/templates/alerting-rules/prometheus-meta-operator.rules.yml

helm/prometheus-rules/templates/alerting-rules/prometheus.rules.yml

marieroque · 2024-04-04T15:11:21Z

helm/prometheus-rules/templates/alerting-rules/loki.rules.yml

@@ -35,9 +35,9 @@ spec:
        description: This alert checks that we have less than 10% errors on Loki requests.
        opsrecipe: loki/
      expr: |
-        100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (cluster_id, namespace, job, route)
+        100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[2m])) by (cluster_id, installation, provider, pipeline, namespace, job, route)


Currently Loki ServiceMonitor is scraping every 15s so that change is not mandatory.
But we should change the scrapeInterval to 1m IMO and so it's safer to have that change now...

marieroque · 2024-04-04T15:12:35Z

helm/prometheus-rules/templates/alerting-rules/prometheus-meta-operator.rules.yml

            label_replace(
              kube_pod_container_status_running{container="prometheus", namespace!="{{ .Values.managementCluster.name }}-prometheus", namespace=~".*-prometheus"},
              "cluster_id", "$2", "pod", "(prometheus-)(.+)(-.+)"
            )
          )
-        ) + (
-          sum by (cluster_name) (
+        ) or (


That query should not be worked as cluster_name does not exist.

Why would it not work? cluster_name came from the label replace so it did work and we tested it with herve. Also why was this changed from a + to an or?

label_replace is missing in
sum by (cluster_name) ( capi_cluster_status_phase{phase!="Deleting"} )

+ is not working as or and in that case if first part return nothing and second part return something, the final result will be empty.
It's not what we want, first case is for vintage and second for capi, so we need to do or

marieroque · 2024-04-04T15:13:35Z

helm/prometheus-rules/templates/alerting-rules/promtail.rules.yml

            description: This alert checks if that the amount of failed requests is below 10% for promtail
            opsrecipe: promtail-requests-are-failing/
          expr: |
-            100 * sum(rate(promtail_request_duration_seconds_count{status_code=~"5..|failed"}[1m])) by (cluster_id, namespace, job, route, instance) / sum(rate(promtail_request_duration_seconds_count[1m])) by (cluster_id, namespace, job, route, instance) > 10
+            100 * sum(rate(promtail_request_duration_seconds_count{status_code=~"5..|failed"}[2m])) by (cluster_id, installation, provider, pipeline, namespace, job, route, instance) / sum(rate(promtail_request_duration_seconds_count[2m])) by (cluster_id, installation, provider, pipeline, namespace, job, route, instance) > 10


Fix the rate interval as Promtail ServiceMonitor is scraping every minute.

marieroque · 2024-04-04T15:14:08Z

helm/prometheus-rules/templates/recording-rules/service-level.rules.yml

@@ -370,15 +370,15 @@ spec:

    # -- Managed Prometheus
    # Set SLO request to always be 1 when a managed prometheus target is present.
-    - expr: (up{app="prometheus-operator-app-prometheus",container="prometheus"}*0)+1
+    - expr: (up{app=~"kube-prometheus-stack-prometheus-operator|prometheus-operator-app-prometheus",container=~"kube-prometheus-stack|prometheus"}*0)+1


Fix rules since prometheus-operator has changed its name.

marieroque · 2024-04-04T15:14:28Z

helm/prometheus-rules/templates/recording-rules/service-level.rules.yml

@@ -388,15 +388,15 @@ spec:

    # -- Managed Alertmanager
    # Set SLO request to always be 1 when a managed alertmanager target is present.
-    - expr: (up{app="prometheus-operator-app-alertmanager", container="alertmanager"}*0)+1
+    - expr: (up{app=~"alertmanager|prometheus-operator-app-alertmanager",container="alertmanager"}*0)+1


Fix rules since alertmanager has changed its name.

QuentinBisson

LGTM if tested

marieroque · 2024-04-05T12:27:58Z

Yes tested as much as possible.
Let's merge it on Monday or Tuesday...

Make Atlas rules compatible with Mimir

36b78f8

marieroque requested a review from a team as a code owner April 4, 2024 14:11

marieroque marked this pull request as draft April 4, 2024 14:13

QuentinBisson reviewed Apr 4, 2024

View reviewed changes

helm/prometheus-rules/templates/alerting-rules/keda.rules.yml Outdated Show resolved Hide resolved

QuentinBisson reviewed Apr 4, 2024

View reviewed changes

helm/prometheus-rules/templates/alerting-rules/prometheus-meta-operator.rules.yml Outdated Show resolved Hide resolved

QuentinBisson reviewed Apr 4, 2024

View reviewed changes

helm/prometheus-rules/templates/alerting-rules/prometheus.rules.yml Outdated Show resolved Hide resolved

Marie Roque and others added 3 commits April 4, 2024 17:00

Update unit tests

5002f23

Merge branch 'master' into review-rules-1

14fa0e8

Fix review comments

bcc9d7b

marieroque marked this pull request as ready for review April 4, 2024 15:07

marieroque commented Apr 4, 2024

View reviewed changes

rewrite MatchingNumberOfPrometheusAndCluster

01dffb6

QuentinBisson approved these changes Apr 5, 2024

View reviewed changes

QuentinBisson merged commit 953aa3a into master Apr 8, 2024
5 checks passed

QuentinBisson deleted the review-rules-1 branch April 8, 2024 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Atlas rules compatible with Mimir #1102

Make Atlas rules compatible with Mimir #1102

marieroque commented Apr 4, 2024 •

edited

Loading

marieroque Apr 4, 2024

marieroque Apr 4, 2024

QuentinBisson Apr 4, 2024

marieroque Apr 5, 2024

marieroque Apr 5, 2024 •

edited

Loading

marieroque Apr 4, 2024

marieroque Apr 4, 2024

marieroque Apr 4, 2024

QuentinBisson left a comment

marieroque commented Apr 5, 2024

Make Atlas rules compatible with Mimir #1102

Make Atlas rules compatible with Mimir #1102

Conversation

marieroque commented Apr 4, 2024 • edited Loading

Checklist

marieroque Apr 4, 2024

Choose a reason for hiding this comment

marieroque Apr 4, 2024

Choose a reason for hiding this comment

QuentinBisson Apr 4, 2024

Choose a reason for hiding this comment

marieroque Apr 5, 2024

Choose a reason for hiding this comment

marieroque Apr 5, 2024 • edited Loading

Choose a reason for hiding this comment

marieroque Apr 4, 2024

Choose a reason for hiding this comment

marieroque Apr 4, 2024

Choose a reason for hiding this comment

marieroque Apr 4, 2024

Choose a reason for hiding this comment

QuentinBisson left a comment

Choose a reason for hiding this comment

marieroque commented Apr 5, 2024

marieroque commented Apr 4, 2024 •

edited

Loading

marieroque Apr 5, 2024 •

edited

Loading