-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix monitoring alerts #1050
Fix monitoring alerts #1050
Conversation
@@ -31,7 +31,7 @@ spec: | |||
- alert: PrometheusOperatorListErrors | |||
annotations: | |||
description: Errors while performing List operations in controller {{`{{`}}$labels.controller{{`}}`}} in {{`{{`}}$labels.namespace{{`}}`}} namespace. | |||
expr: (sum by (controller,namespace) (rate(prometheus_operator_list_operations_failed_total{app="prometheus-operator",namespace="{{ .Values.namespace }}"}[10m])) / sum by (controller,namespace) (rate(prometheus_operator_list_operations_total{app="prometheus-operator",namespace="{{ .Values.namespace }}"}[10m]))) > 0.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we have this namespace
filter in the first place?
My best guess is to prevent monitoring customer operators. If that's the case, what prevents us from that now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 things, first we own this app so we should technically have been notified anyway. I think we added that back in the day from mixins but we can adjust them if they page too much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think I remember adding some similar exclusions because customer-managed operators were failing too much.
But all right, fine with me, let's try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh maybe you did but still the namespace is not monitoring anymore :D
Before adding a new alerting rule into this repository you should consider creating an SLO rules instead.
SLO helps you both increase the quality of your monitoring and reduce the alert noise.
Towards: giantswarm/roadmap#3157
This PR fixes remote write and prometheus-operator alerts for mimir
Checklist
oncall-kaas-cloud
GitHub group).