Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix monitoring alerts #1050

Merged
merged 3 commits into from
Mar 5, 2024
Merged

Fix monitoring alerts #1050

merged 3 commits into from
Mar 5, 2024

Conversation

QuentinBisson
Copy link
Contributor

Before adding a new alerting rule into this repository you should consider creating an SLO rules instead.
SLO helps you both increase the quality of your monitoring and reduce the alert noise.


Towards: giantswarm/roadmap#3157

This PR fixes remote write and prometheus-operator alerts for mimir

Checklist

@QuentinBisson QuentinBisson self-assigned this Mar 5, 2024
@QuentinBisson QuentinBisson requested a review from a team as a code owner March 5, 2024 14:02
@QuentinBisson QuentinBisson enabled auto-merge (squash) March 5, 2024 14:02
@@ -31,7 +31,7 @@ spec:
- alert: PrometheusOperatorListErrors
annotations:
description: Errors while performing List operations in controller {{`{{`}}$labels.controller{{`}}`}} in {{`{{`}}$labels.namespace{{`}}`}} namespace.
expr: (sum by (controller,namespace) (rate(prometheus_operator_list_operations_failed_total{app="prometheus-operator",namespace="{{ .Values.namespace }}"}[10m])) / sum by (controller,namespace) (rate(prometheus_operator_list_operations_total{app="prometheus-operator",namespace="{{ .Values.namespace }}"}[10m]))) > 0.4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we have this namespace filter in the first place?
My best guess is to prevent monitoring customer operators. If that's the case, what prevents us from that now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 things, first we own this app so we should technically have been notified anyway. I think we added that back in the day from mixins but we can adjust them if they page too much

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think I remember adding some similar exclusions because customer-managed operators were failing too much.
But all right, fine with me, let's try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh maybe you did but still the namespace is not monitoring anymore :D

@QuentinBisson QuentinBisson merged commit e6b011c into master Mar 5, 2024
5 checks passed
@QuentinBisson QuentinBisson deleted the fix-monitoring-alerts branch March 5, 2024 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants