Skip to content

Commit

Permalink
Add ops recipe for flux suspended for too long (#1133)
Browse files Browse the repository at this point in the history
  • Loading branch information
uvegla authored Apr 18, 2024
1 parent d3806b6 commit 667f14a
Show file tree
Hide file tree
Showing 5 changed files with 50 additions and 18 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- Add ops recipe for flux being suspended for too long alert.

## [3.11.1] - 2024-04-17

### Added
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ tests:
```
Let's breakdown the above example:
* For the first input series, the prometheus timesies returns an `empty query result` for 20 minutes (20*interval), then it is returning the value `1` for 20 minutes. Finally, it is returning the value `0` for 20 minutes.
* For the first input series, the prometheus timeseries returns an `empty query result` for 20 minutes (20*interval), then it is returning the value `1` for 20 minutes. Finally, it is returning the value `0` for 20 minutes.
This is a good example of an input series for testing an `up` query.
* The second series introduce a timeseries which first returns a `0` value and which adds `600` every minutes (=interval) for 40 minutes. After 40 minutes it has reached a value of `24000` (600x40) and goes on by adding `400` every minutes for 40 more minutes.
This is a good example of an input series for testing a `range` query.
Expand Down
33 changes: 17 additions & 16 deletions helm/prometheus-rules/templates/alerting-rules/flux.rules.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ spec:
expr: gotk_reconcile_condition{type="Ready", status="False", kind="HelmRelease", cluster_type="management_cluster", namespace="flux-giantswarm", exported_namespace=~".*giantswarm.*"} > 0
for: 10m
labels:
area: kaas
area: empowerment
cancel_if_outside_working_hours: "true"
severity: page
team: honeybadger
Expand All @@ -48,7 +48,7 @@ spec:
expr: gotk_reconcile_condition{type="Ready", status="False", kind="HelmRelease", cluster_type="workload_cluster", organization="giantswarm"} > 0
for: 2h
labels:
area: kaas
area: empowerment
severity: page
cancel_if_outside_working_hours: "true"
team: honeybadger
Expand All @@ -63,7 +63,7 @@ spec:
expr: gotk_reconcile_condition{type="Ready", status="False", kind="Kustomization", cluster_type="management_cluster", namespace="flux-giantswarm", exported_namespace=~".*giantswarm.*"} > 0
for: 20m
labels:
area: kaas
area: empowerment
cancel_if_outside_working_hours: "true"
severity: page
team: honeybadger
Expand All @@ -76,7 +76,7 @@ spec:
expr: gotk_reconcile_condition{type="Ready", status="False", kind="Kustomization", cluster_type="workload_cluster", organization="giantswarm"} > 0
for: 2h
labels:
area: kaas
area: empowerment
severity: page
cancel_if_outside_working_hours: "true"
team: honeybadger
Expand All @@ -89,7 +89,7 @@ spec:
expr: gotk_reconcile_condition{type="Ready", status="False", kind=~"GitRepository|HelmRepository|Bucket", cluster_type="management_cluster", namespace="flux-giantswarm", exported_namespace=~".*giantswarm.*"} > 0
for: 2h
labels:
area: kaas
area: empowerment
cancel_if_outside_working_hours: "true"
severity: page
team: honeybadger
Expand All @@ -102,7 +102,7 @@ spec:
expr: gotk_reconcile_condition{type="Ready", status="False", kind=~"GitRepository|HelmRepository|Bucket", cluster_type="workload_cluster", organization="giantswarm"} > 0
for: 2h
labels:
area: kaas
area: empowerment
severity: page
cancel_if_outside_working_hours: "true"
team: honeybadger
Expand All @@ -125,7 +125,7 @@ spec:
/ (7*24*6) < 0.97
for: 10m
labels:
area: kaas
area: empowerment
cancel_if_outside_working_hours: "true"
severity: page
team: honeybadger
Expand All @@ -134,10 +134,11 @@ spec:
annotations:
description: |-
{{`Flux {{ $labels.kind }} {{ $labels.name }} in ns {{ $labels.exported_namespace }} on {{ $labels.installation }} has been suspended for 24h.`}}
opsrecipe: fluxcd-suspended-for-too-long/
expr: gotk_suspend_status{namespace="flux-giantswarm", exported_namespace="flux-giantswarm"} > 0
for: 24h
labels:
area: kaas
area: empowerment
cancel_if_outside_working_hours: "true"
severity: page
team: honeybadger
Expand Down Expand Up @@ -167,7 +168,7 @@ spec:
expr: gotk_reconcile_condition{type="Ready", status="False", kind="HelmRelease", cluster_type="management_cluster", exported_namespace!~".*giantswarm.*"} > 0
for: 10m
labels:
area: kaas
area: empowerment
cancel_if_outside_working_hours: {{ include "workingHoursOnly" . }}
severity: notify
team: honeybadger
Expand All @@ -180,7 +181,7 @@ spec:
expr: gotk_reconcile_condition{type="Ready", status="False", kind="HelmRelease", cluster_type="workload_cluster", organization!="giantswarm"} > 0
for: 2h
labels:
area: kaas
area: empowerment
severity: notify
cancel_if_outside_working_hours: "true"
team: honeybadger
Expand All @@ -193,7 +194,7 @@ spec:
expr: gotk_reconcile_condition{type="Ready", status="False", kind="Kustomization", cluster_type="management_cluster", exported_namespace!~".*giantswarm.*"} > 0
for: 10m
labels:
area: kaas
area: empowerment
cancel_if_outside_working_hours: {{ include "workingHoursOnly" . }}
severity: notify
team: honeybadger
Expand All @@ -206,7 +207,7 @@ spec:
expr: gotk_reconcile_condition{type="Ready", status="False", kind="Kustomization", cluster_type="workload_cluster", organization!="giantswarm"} > 0
for: 2h
labels:
area: kaas
area: empowerment
severity: notify
cancel_if_outside_working_hours: "true"
team: honeybadger
Expand All @@ -219,7 +220,7 @@ spec:
expr: gotk_reconcile_condition{type="Ready", status="False", kind=~"GitRepository|HelmRepository|Bucket", cluster_type="management_cluster", exported_namespace!~".*giantswarm.*"} > 0
for: 2h
labels:
area: kaas
area: empowerment
cancel_if_outside_working_hours: {{ include "workingHoursOnly" . }}
severity: notify
team: honeybadger
Expand All @@ -232,7 +233,7 @@ spec:
expr: gotk_reconcile_condition{type="Ready", status="False", kind=~"GitRepository|HelmRepository|Bucket", cluster_type="workload_cluster", organization!="giantswarm"} > 0
for: 2h
labels:
area: kaas
area: empowerment
severity: notify
cancel_if_outside_working_hours: "true"
team: honeybadger
Expand All @@ -247,7 +248,7 @@ spec:
sum(rate(controller_runtime_reconcile_time_seconds_count{app=~".*flux.*", namespace!~".*giantswarm.*"}[5m])) by (installation, cluster_id, controller)) > 60
for: 10m
labels:
area: kaas
area: empowerment
cancel_if_outside_working_hours: "true"
severity: notify
team: honeybadger
Expand All @@ -261,7 +262,7 @@ spec:
sum by (name, namespace) (workqueue_unfinished_work_seconds{namespace=~"flux-giantswarm|flux-system"}) > 3600.0
for: 10m
labels:
area: kaas
area: empowerment
cancel_if_outside_working_hours: "true"
severity: page
team: honeybadger
Expand Down
1 change: 0 additions & 1 deletion test/conf/promtool_ignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ templates/alerting-rules/external-dns.rules.yml
templates/alerting-rules/fairness.rules.yml
templates/alerting-rules/falco.rules.yml
templates/alerting-rules/fluentbit.rules.yml
templates/alerting-rules/flux.rules.yml
templates/alerting-rules/helm.rules.yml
templates/alerting-rules/ingress-controller.rules.yml
templates/alerting-rules/inhibit.all.rules.yml
Expand Down
28 changes: 28 additions & 0 deletions test/tests/providers/global/flux.rules.test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
rule_files:
- flux.rules.yml

tests:
- interval: 1m
input_series:
- series: 'gotk_suspend_status{installation="test", namespace="flux-giantswarm", exported_namespace="flux-giantswarm", kind="Kustomization", name="flux"}'
values: "1x60 0+1x60 1+0x1500"
alert_rule_test:
- alertname: FluxSuspendedForTooLong
eval_time: 1560m
exp_alerts:
- exp_labels:
alertname: "FluxSuspendedForTooLong"
area: "empowerment"
cancel_if_outside_working_hours: "true"
exported_namespace: "flux-giantswarm"
installation: "test"
kind: "Kustomization"
name: "flux"
namespace: "flux-giantswarm"
severity: "page"
team: "honeybadger"
topic: "releng"
exp_annotations:
description: "Flux Kustomization flux in ns flux-giantswarm on test has been suspended for 24h."
opsrecipe: "fluxcd-suspended-for-too-long/"

0 comments on commit 667f14a

Please sign in to comment.