-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Atlas rules compatible with Mimir #1102
Conversation
helm/prometheus-rules/templates/alerting-rules/prometheus-meta-operator.rules.yml
Outdated
Show resolved
Hide resolved
helm/prometheus-rules/templates/alerting-rules/prometheus.rules.yml
Outdated
Show resolved
Hide resolved
@@ -35,9 +35,9 @@ spec: | |||
description: This alert checks that we have less than 10% errors on Loki requests. | |||
opsrecipe: loki/ | |||
expr: | | |||
100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (cluster_id, namespace, job, route) | |||
100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[2m])) by (cluster_id, installation, provider, pipeline, namespace, job, route) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently Loki ServiceMonitor is scraping every 15s so that change is not mandatory.
But we should change the scrapeInterval to 1m IMO and so it's safer to have that change now...
label_replace( | ||
kube_pod_container_status_running{container="prometheus", namespace!="{{ .Values.managementCluster.name }}-prometheus", namespace=~".*-prometheus"}, | ||
"cluster_id", "$2", "pod", "(prometheus-)(.+)(-.+)" | ||
) | ||
) | ||
) + ( | ||
sum by (cluster_name) ( | ||
) or ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That query should not be worked as cluster_name does not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it not work? cluster_name came from the label replace so it did work and we tested it with herve. Also why was this changed from a + to an or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
label_replace is missing in
sum by (cluster_name) ( capi_cluster_status_phase{phase!="Deleting"} )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+
is not working as or
and in that case if first part return nothing and second part return something, the final result will be empty.
It's not what we want, first case is for vintage and second for capi, so we need to do or
description: This alert checks if that the amount of failed requests is below 10% for promtail | ||
opsrecipe: promtail-requests-are-failing/ | ||
expr: | | ||
100 * sum(rate(promtail_request_duration_seconds_count{status_code=~"5..|failed"}[1m])) by (cluster_id, namespace, job, route, instance) / sum(rate(promtail_request_duration_seconds_count[1m])) by (cluster_id, namespace, job, route, instance) > 10 | ||
100 * sum(rate(promtail_request_duration_seconds_count{status_code=~"5..|failed"}[2m])) by (cluster_id, installation, provider, pipeline, namespace, job, route, instance) / sum(rate(promtail_request_duration_seconds_count[2m])) by (cluster_id, installation, provider, pipeline, namespace, job, route, instance) > 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the rate interval as Promtail ServiceMonitor is scraping every minute.
@@ -370,15 +370,15 @@ spec: | |||
|
|||
# -- Managed Prometheus | |||
# Set SLO request to always be 1 when a managed prometheus target is present. | |||
- expr: (up{app="prometheus-operator-app-prometheus",container="prometheus"}*0)+1 | |||
- expr: (up{app=~"kube-prometheus-stack-prometheus-operator|prometheus-operator-app-prometheus",container=~"kube-prometheus-stack|prometheus"}*0)+1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix rules since prometheus-operator has changed its name.
@@ -388,15 +388,15 @@ spec: | |||
|
|||
# -- Managed Alertmanager | |||
# Set SLO request to always be 1 when a managed alertmanager target is present. | |||
- expr: (up{app="prometheus-operator-app-alertmanager", container="alertmanager"}*0)+1 | |||
- expr: (up{app=~"alertmanager|prometheus-operator-app-alertmanager",container="alertmanager"}*0)+1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix rules since alertmanager has changed its name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if tested
Yes tested as much as possible. |
Before adding a new alerting rule into this repository you should consider creating an SLO rules instead.
SLO helps you both increase the quality of your monitoring and reduce the alert noise.
Towards: giantswarm/roadmap#3318
This PR makes Atlas rules compatible with Mimir by:
cluster_id, installation, provider, pipeline
absent
functionsChecklist
oncall-kaas-cloud
GitHub group).