Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Atlas rules compatible with Mimir #1102

Merged
merged 5 commits into from
Apr 8, 2024
Merged

Conversation

marieroque
Copy link
Contributor

@marieroque marieroque commented Apr 4, 2024

Before adding a new alerting rule into this repository you should consider creating an SLO rules instead.
SLO helps you both increase the quality of your monitoring and reduce the alert noise.


Towards: giantswarm/roadmap#3318

This PR makes Atlas rules compatible with Mimir by:

  • adding labels on aggregation functions cluster_id, installation, provider, pipeline
  • rewrite some of absent functions

Checklist

@marieroque marieroque requested a review from a team as a code owner April 4, 2024 14:11
@marieroque marieroque marked this pull request as draft April 4, 2024 14:13
@marieroque marieroque marked this pull request as ready for review April 4, 2024 15:07
@@ -35,9 +35,9 @@ spec:
description: This alert checks that we have less than 10% errors on Loki requests.
opsrecipe: loki/
expr: |
100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (cluster_id, namespace, job, route)
100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[2m])) by (cluster_id, installation, provider, pipeline, namespace, job, route)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently Loki ServiceMonitor is scraping every 15s so that change is not mandatory.
But we should change the scrapeInterval to 1m IMO and so it's safer to have that change now...

label_replace(
kube_pod_container_status_running{container="prometheus", namespace!="{{ .Values.managementCluster.name }}-prometheus", namespace=~".*-prometheus"},
"cluster_id", "$2", "pod", "(prometheus-)(.+)(-.+)"
)
)
) + (
sum by (cluster_name) (
) or (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That query should not be worked as cluster_name does not exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it not work? cluster_name came from the label replace so it did work and we tested it with herve. Also why was this changed from a + to an or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

label_replace is missing in
sum by (cluster_name) ( capi_cluster_status_phase{phase!="Deleting"} )

Copy link
Contributor Author

@marieroque marieroque Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ is not working as or and in that case if first part return nothing and second part return something, the final result will be empty.
It's not what we want, first case is for vintage and second for capi, so we need to do or

description: This alert checks if that the amount of failed requests is below 10% for promtail
opsrecipe: promtail-requests-are-failing/
expr: |
100 * sum(rate(promtail_request_duration_seconds_count{status_code=~"5..|failed"}[1m])) by (cluster_id, namespace, job, route, instance) / sum(rate(promtail_request_duration_seconds_count[1m])) by (cluster_id, namespace, job, route, instance) > 10
100 * sum(rate(promtail_request_duration_seconds_count{status_code=~"5..|failed"}[2m])) by (cluster_id, installation, provider, pipeline, namespace, job, route, instance) / sum(rate(promtail_request_duration_seconds_count[2m])) by (cluster_id, installation, provider, pipeline, namespace, job, route, instance) > 10
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the rate interval as Promtail ServiceMonitor is scraping every minute.

@@ -370,15 +370,15 @@ spec:

# -- Managed Prometheus
# Set SLO request to always be 1 when a managed prometheus target is present.
- expr: (up{app="prometheus-operator-app-prometheus",container="prometheus"}*0)+1
- expr: (up{app=~"kube-prometheus-stack-prometheus-operator|prometheus-operator-app-prometheus",container=~"kube-prometheus-stack|prometheus"}*0)+1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix rules since prometheus-operator has changed its name.

@@ -388,15 +388,15 @@ spec:

# -- Managed Alertmanager
# Set SLO request to always be 1 when a managed alertmanager target is present.
- expr: (up{app="prometheus-operator-app-alertmanager", container="alertmanager"}*0)+1
- expr: (up{app=~"alertmanager|prometheus-operator-app-alertmanager",container="alertmanager"}*0)+1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix rules since alertmanager has changed its name.

Copy link
Contributor

@QuentinBisson QuentinBisson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if tested

@marieroque
Copy link
Contributor Author

Yes tested as much as possible.
Let's merge it on Monday or Tuesday...

@QuentinBisson QuentinBisson merged commit 953aa3a into master Apr 8, 2024
5 checks passed
@QuentinBisson QuentinBisson deleted the review-rules-1 branch April 8, 2024 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants