Skip to content

Commit

Permalink
Fix honeybadger alerts for mimir (#1174)
Browse files Browse the repository at this point in the history
  • Loading branch information
QuentinBisson authored May 14, 2024
1 parent 3b56325 commit 4081fb2
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 7 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Fixed

- Fix honeybadger alerts for mimir.
- Remove cilium entry from KAAS SLOs.
- Fix cert-manager rules for mimir.
- Fix operatorkit related alerts for mimir.
Expand Down
12 changes: 6 additions & 6 deletions helm/prometheus-rules/templates/alerting-rules/flux.rules.yml
Original file line number Diff line number Diff line change
Expand Up @@ -117,9 +117,9 @@ spec:
1-sum_over_time(
(
(
sum(increase(gotk_reconcile_duration_seconds_sum{namespace="flux-giantswarm", exported_namespace="flux-giantswarm",kind=~"Kustomization|HelmRelease"}[10m])) by (kind,name,cluster_id,installation)
sum(increase(gotk_reconcile_duration_seconds_sum{namespace="flux-giantswarm", exported_namespace="flux-giantswarm",kind=~"Kustomization|HelmRelease"}[10m])) by (kind, name, cluster_id, installation, pipeline, provider)
/
sum(increase(gotk_reconcile_duration_seconds_count{namespace="flux-giantswarm", exported_namespace="flux-giantswarm",kind=~"Kustomization|HelmRelease"}[10m])) by (kind,name,cluster_id,installation)
sum(increase(gotk_reconcile_duration_seconds_count{namespace="flux-giantswarm", exported_namespace="flux-giantswarm",kind=~"Kustomization|HelmRelease"}[10m])) by (kind, name, cluster_id, installation, pipeline, provider)
)
>bool 360)[7d:10m])
/ (7*24*6) < 0.97
Expand Down Expand Up @@ -149,7 +149,7 @@ spec:
{{`Flux Image Automation Controller on {{ $labels.installation }} seems stuck.`}}
opsrecipe: flux-image-automation-stuck/
expr: |
sum(irate(workqueue_unfinished_work_seconds{name="imageupdateautomation",cluster_type="management_cluster",namespace=~"flux-giantswarm|flux-system"}[15m])) > 0
sum(irate(workqueue_unfinished_work_seconds{name="imageupdateautomation",cluster_type="management_cluster",namespace=~"flux-giantswarm|flux-system"}[15m])) by (cluster_id, installation, pipeline, provider) > 0
for: 30m
labels:
area: empowerment
Expand Down Expand Up @@ -244,8 +244,8 @@ spec:
{{`Flux controller {{ $labels.controller }} on {{ $labels.installation }}/{{ $labels.cluster_id }} is reconciling very slowly.`}}
opsrecipe: fluxcd-slow-reconciliation/
expr: |
(sum(rate(controller_runtime_reconcile_time_seconds_sum{app=~".*flux.*", namespace!~".*giantswarm.*"}[5m])) by (installation, cluster_id, controller) /
sum(rate(controller_runtime_reconcile_time_seconds_count{app=~".*flux.*", namespace!~".*giantswarm.*"}[5m])) by (installation, cluster_id, controller)) > 60
(sum(rate(controller_runtime_reconcile_time_seconds_sum{app=~".*flux.*", namespace!~".*giantswarm.*"}[5m])) by (controller, cluster_id, installation, pipeline, provider) /
sum(rate(controller_runtime_reconcile_time_seconds_count{app=~".*flux.*", namespace!~".*giantswarm.*"}[5m])) by (controller, cluster_id, installation, pipeline, provider)) > 60
for: 10m
labels:
area: empowerment
Expand All @@ -259,7 +259,7 @@ spec:
{{`Flux artifacts are stuck in work queue for over 1 hour and are not being reconciled.`}}
opsrecipe: fluxcd-workqueue-too-long/
expr: |
sum by (name, namespace) (workqueue_unfinished_work_seconds{namespace=~"flux-giantswarm|flux-system"}) > 3600.0
sum by (cluster_id, installation, name, namespace, pipeline, provider) (workqueue_unfinished_work_seconds{namespace=~"flux-giantswarm|flux-system"}) > 3600.0
for: 10m
labels:
area: empowerment
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ spec:
annotations:
description: '{{`Helm release Secret count too high.`}}'
opsrecipe: clean-up-secrets/
expr: sum(kube_secret_info{namespace=~"giantswarm|kube-system|monitoring",secret=~"sh.helm.+"}) by (cluster_id) > 1000
expr: sum(kube_secret_info{namespace=~"giantswarm|kube-system|monitoring",secret=~"sh.helm.+"}) by (cluster_id, installation, pipeline, provider) > 1000
for: 15m
labels:
area: managedservices
Expand Down

0 comments on commit 4081fb2

Please sign in to comment.