Skip to content

Commit

Permalink
Merge branch 'main' into move-ownership
Browse files Browse the repository at this point in the history
  • Loading branch information
QuentinBisson committed Jun 10, 2024
2 parents d48f106 + 6a20ebf commit 5d27ac0
Show file tree
Hide file tree
Showing 9 changed files with 53 additions and 47 deletions.
13 changes: 6 additions & 7 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Fixed

- Fixed usage of yq, and jq in check-opsrecipes.sh
- Fetch jq with make install-tools

### Added

- Added a new alerting rule to `falco.rules.yml` to fire an alert for XZ-backdoor.
- Add `CiliumAPITooSlow`.
- Added `CiliumAPITooSlow`.
- Added `CODEOWNERS` files.

### Changed

Expand All @@ -25,12 +21,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Move the management cluster certificate alerts into the shared alerts because it is provider independent
- Review and fix phoenix alerts towards Mimir and multi-provider MCs.
- Moves cluster-autoscaler and vpa alerts to turtles.
- Reviewed turtles alerts labels.
- Use `ready` replicas for Kyverno webhooks alert.
- Moves ownership of alerts for shared components to turtles.

### Fixed

- Fix and improve the ops-recipe test script.
- Fixed usage of yq, and jq in check-opsrecipes.sh
- Fetch jq with make install-tools
- Fix and improve the check-opsrecipes.sh script so support <directory>/_index.md based ops-recipes.
- Fix cabbage alerts for multi-provider wcs.
- Fix shield alert area labels.
- Fix `cert-exporter` alerting.
Expand Down
9 changes: 8 additions & 1 deletion CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -1,2 +1,9 @@
# generated by giantswarm/github actions - changes will be overwritten
* @giantswarm/team-atlas
/helm/prometheus-rules/templates/kaas/bigmac/ @team-bigmac
/helm/prometheus-rules/templates/kaas/phoenix/ @team-phoenix
/helm/prometheus-rules/templates/kaas/rocket/ @team-rocket
/helm/prometheus-rules/templates/kaas/turtles/ @team-turtles
/helm/prometheus-rules/templates/platform/atlas/ @team-atlas
/helm/prometheus-rules/templates/platform/cabbage/ @team-cabbage
/helm/prometheus-rules/templates/platform/honeybadger/ @team-honeybadger
/helm/prometheus-rules/templates/platform/shield/ @team-shield
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ spec:
annotations:
description: '{{`Kubernetes API Server admission webhook {{ $labels.name }} is timing out.`}}'
opsrecipe: apiserver-admission-webhook-errors/
expr: histogram_quantile(0.95, sum(rate(apiserver_admission_webhook_admission_duration_seconds_bucket{cluster_type="management_cluster"}[5m])) by (cluster_id, installation, pipeline, provider, name, app, le)) > 5
expr: histogram_quantile(0.95, sum(rate(apiserver_admission_webhook_admission_duration_seconds_bucket{cluster_type="management_cluster"}[5m])) by (cluster_id, installation, pipeline, provider, name, job, le)) > 5
for: 15m
labels:
area: kaas
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ spec:
annotations:
description: '{{`Kubernetes API Server admission webhook {{ $labels.name }} is timing out.`}}'
opsrecipe: apiserver-admission-webhook-errors/
expr: histogram_quantile(0.95, sum(rate(apiserver_admission_webhook_admission_duration_seconds_bucket{cluster_type="workload_cluster", name!~".*(prometheus|vpa.k8s.io|linkerd|validate.nginx.ingress.kubernetes.io|kong.konghq.com|cert-manager.io|kyverno|app-admission-controller).*"}[5m])) by (cluster_id, installation, pipeline, provider, name, app, le)) > 5
expr: histogram_quantile(0.95, sum(rate(apiserver_admission_webhook_admission_duration_seconds_bucket{cluster_type="workload_cluster", name!~".*(prometheus|vpa.k8s.io|linkerd|validate.nginx.ingress.kubernetes.io|kong.konghq.com|cert-manager.io|kyverno|app-admission-controller).*"}[5m])) by (cluster_id, installation, pipeline, provider, name, job, le)) > 5
for: 15m
labels:
area: kaas
Expand All @@ -63,7 +63,7 @@ spec:
annotations:
description: '{{`Kubernetes API Server admission webhook {{ $labels.name }} is timing out.`}}'
opsrecipe: apiserver-admission-webhook-errors/
expr: histogram_quantile(0.95, sum(rate(apiserver_admission_webhook_admission_duration_seconds_bucket{cluster_type="workload_cluster", name=~".*(kyverno|app-admission-controller).*"}[5m])) by (cluster_id, installation, pipeline, provider, name, app, le)) > 5
expr: histogram_quantile(0.95, sum(rate(apiserver_admission_webhook_admission_duration_seconds_bucket{cluster_type="workload_cluster", name=~".*(kyverno|app-admission-controller).*"}[5m])) by (cluster_id, installation, pipeline, provider, name, job, le)) > 5
for: 15m
labels:
area: kaas
Expand All @@ -77,7 +77,7 @@ spec:
annotations:
description: '{{`Kubernetes API Server admission webhook {{ $labels.name }} is timing out.`}}'
opsrecipe: apiserver-admission-webhook-errors/
expr: histogram_quantile(0.95, sum(rate(apiserver_admission_webhook_admission_duration_seconds_bucket{cluster_type="workload_cluster", name=~".*(linkerd|validate.nginx.ingress.kubernetes.io|kong.konghq.com|cert-manager.io).*"}[5m])) by (cluster_id, installation, pipeline, provider, name, app, le)) > 5
expr: histogram_quantile(0.95, sum(rate(apiserver_admission_webhook_admission_duration_seconds_bucket{cluster_type="workload_cluster", name=~".*(linkerd|validate.nginx.ingress.kubernetes.io|kong.konghq.com|cert-manager.io).*"}[5m])) by (cluster_id, installation, pipeline, provider, name, job, le)) > 5
for: 15m
labels:
area: kaas
Expand All @@ -91,7 +91,7 @@ spec:
annotations:
description: '{{`Kubernetes API Server admission webhook {{ $labels.name }} is timing out.`}}'
opsrecipe: apiserver-admission-webhook-errors/
expr: histogram_quantile(0.95, sum(rate(apiserver_admission_webhook_admission_duration_seconds_bucket{cluster_type="workload_cluster", name=~".*(vpa.k8s.io).*"}[5m])) by (cluster_id, installation, pipeline, provider, name, app, le)) > 5
expr: histogram_quantile(0.95, sum(rate(apiserver_admission_webhook_admission_duration_seconds_bucket{cluster_type="workload_cluster", name=~".*(vpa.k8s.io).*"}[5m])) by (cluster_id, installation, pipeline, provider, name, job, le)) > 5
for: 15m
labels:
area: kaas
Expand All @@ -105,7 +105,7 @@ spec:
annotations:
description: '{{`Kubernetes API Server admission webhook {{ $labels.name }} is timing out.`}}'
opsrecipe: apiserver-admission-webhook-errors/
expr: histogram_quantile(0.95, sum(rate(apiserver_admission_webhook_admission_duration_seconds_bucket{cluster_type="workload_cluster", name=~".*(prometheus).*"}[5m])) by (cluster_id, installation, pipeline, provider, name, app, le)) > 5
expr: histogram_quantile(0.95, sum(rate(apiserver_admission_webhook_admission_duration_seconds_bucket{cluster_type="workload_cluster", name=~".*(prometheus).*"}[5m])) by (cluster_id, installation, pipeline, provider, name, job, le)) > 5
for: 15m
labels:
area: kaas
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ spec:
annotations:
description: '{{`Docker memory usage on {{ $labels.instance }} is too high.`}}'
opsrecipe: docker-memory-usage-high/
expr: process_resident_memory_bytes{app="docker"} > (5 * 1024 * 1024 * 1024)
expr: process_resident_memory_bytes{job=~".*/docker-.*"} > (5 * 1024 * 1024 * 1024)
for: 15m
labels:
area: kaas
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ spec:
annotations:
description: '{{`Cadvisor ({{ $labels.instance }}) is down.`}}'
opsrecipe: kubelet-is-down/
expr: label_replace(up{app="cadvisor"}, "ip", "$1", "instance", "(.+):\\d+") == 0
expr: label_replace(up{job="kubelet", metrics_path="/metrics/cadvisor"}, "ip", "$1", "instance", "(.+):\\d+") == 0
for: 1h
labels:
area: kaas
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ spec:
description: '{{`Node {{ $labels.node }} status is flapping under load.`}}'
expr: |
(
sum(node_load15{cluster_type="management_cluster", app!="vault", role!="bastion"})
sum(node_load15{cluster_type="management_cluster", service="node-exporter"})
by (cluster_id, installation, node, pipeline, provider)
/ count(rate(node_cpu_seconds_total{cluster_type="management_cluster", app!="vault", role!="bastion", mode="idle"}[5m]))
/ count(rate(node_cpu_seconds_total{cluster_type="management_cluster", service="node-exporter", mode="idle"}[5m]))
by (cluster_id, installation, node, pipeline, provider)
) >= 2
unless on (cluster_id, installation, node, pipeline, provider) (
Expand Down Expand Up @@ -101,9 +101,9 @@ spec:
annotations:
description: '{{`Machine {{ $labels.node }} CPU load is too high.`}}'
expr: |
sum(node_load5{cluster_type="management_cluster", app!="vault", role!="bastion"})
sum(node_load5{cluster_type="management_cluster", service="node-exporter"})
by (node, cluster_id, installation, pipeline, provider) > 2
* count(rate(node_cpu_seconds_total{cluster_type="management_cluster", mode="idle", app!="vault", role!="bastion"}[5m]))
* count(rate(node_cpu_seconds_total{cluster_type="management_cluster", mode="idle", service="node-exporter"}[5m]))
by (node, cluster_id, installation, pipeline, provider)
for: 3m
labels:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ spec:
rules:
- alert: OperatorkitErrorRateTooHighHoneybadger
annotations:
description: '{{`{{ $labels.namespace }}/{{ $labels.app }} has reported errors. Please check logs.`}}'
description: '{{`{{ $labels.namespace }}/{{ $labels.pod }} has reported errors. Please check logs.`}}'
opsrecipe: check-operator-error-rate-high/
expr: operatorkit_controller_error_total{app=~"app-operator.*|chart-operator.*"} > 5
expr: operatorkit_controller_error_total{pod=~"app-operator.*|chart-operator.*"} > 5
for: 1m
labels:
area: kaas
Expand All @@ -23,8 +23,8 @@ spec:
topic: qa
- alert: OperatorNotReconcilingHoneybadger
annotations:
description: '{{`{{ $labels.namespace }}/{{ $labels.app }} not reconciling controller {{$labels.controller}}. Please check logs.`}}'
expr: (time() - operatorkit_controller_last_reconciled{app=~"app-operator.*|chart-operator.*"}) / 60 > 30
description: '{{`{{ $labels.namespace }}/{{ $labels.pod }} not reconciling controller {{$labels.controller}}. Please check logs.`}}'
expr: (time() - operatorkit_controller_last_reconciled{pod=~"app-operator.*|chart-operator.*"}) / 60 > 30
for: 10m
labels:
area: managedservices
Expand All @@ -33,9 +33,9 @@ spec:
topic: releng
- alert: OperatorkitErrorRateTooHighPhoenix
annotations:
description: '{{`{{ $labels.namespace }}/{{ $labels.app }}@{{ $labels.app_version }} has reported errors. Please check the logs.`}}'
description: '{{`{{ $labels.namespace }}/{{ $labels.pod }}@{{ $labels.app_version }} has reported errors. Please check the logs.`}}'
opsrecipe: check-operator-error-rate-high/
expr: rate(operatorkit_controller_error_total{app=~"aws-.*"}[5m]) > 1
expr: rate(operatorkit_controller_error_total{pod=~"aws-.*"}[5m]) > 1
for: 10m
labels:
area: kaas
Expand All @@ -47,9 +47,9 @@ spec:
# be paged to be able to fix the issue immediately.
- alert: OperatorkitErrorRateTooHighAWS
annotations:
description: '{{`{{ $labels.namespace }}/{{ $labels.app }}@{{ $labels.app_version }} has reported errors. Please check the logs.`}}'
description: '{{`{{ $labels.namespace }}/{{ $labels.pod }}@{{ $labels.app_version }} has reported errors. Please check the logs.`}}'
opsrecipe: check-operator-error-rate-high/
expr: operatorkit_controller_error_total{app=~"aws-operator.+|cluster-operator.+"} > 5
expr: operatorkit_controller_error_total{pod=~"aws-operator.+|cluster-operator.+"} > 5
for: 1m
labels:
area: kaas
Expand All @@ -62,9 +62,9 @@ spec:
# wrong to fix the root cause eventually.
- alert: OperatorkitCRNotDeletedAWS
annotations:
description: '{{`{{ $labels.namespace }}/{{ $labels.app }}@{{ $labels.app_version }} has not deleted object {{ $labels.namespace }}/{{ $labels.name }} of type {{ $labels.kind }} for too long.`}}'
description: '{{`{{ $labels.namespace }}/{{ $labels.pod }}@{{ $labels.app_version }} has not deleted object {{ $labels.namespace }}/{{ $labels.name }} of type {{ $labels.kind }} for too long.`}}'
opsrecipe: check-not-deleted-object/
expr: (time() - operatorkit_controller_deletion_timestamp{app=~"aws-operator.+|cluster-operator.+", provider="aws"}) > 18000
expr: (time() - operatorkit_controller_deletion_timestamp{pod=~"aws-operator.+|cluster-operator.+", provider="aws"}) > 18000
for: 5m
labels:
area: kaas
Expand All @@ -75,9 +75,9 @@ spec:
# be paged to be able to fix the issue immediately.
- alert: OperatorNotReconcilingAWS
annotations:
description: '{{`{{ $labels.namespace }}/{{ $labels.app }}@{{ $labels.app_version }} has stopped the reconciliation. Please check logs.`}}'
description: '{{`{{ $labels.namespace }}/{{ $labels.pod }}@{{ $labels.app_version }} has stopped the reconciliation. Please check logs.`}}'
opsrecipe: operator-not-reconciling/
expr: (sum by (cluster_id, installation, pipeline, provider, instance, app, app_version, namespace)(increase(operatorkit_controller_event_count{app=~"aws-operator.+|cluster-operator.+"}[10m])) == 0 and on (cluster_id, instance) (operatorkit_controller_deletion_timestamp or operatorkit_controller_creation_timestamp))
expr: (sum by (cluster_id, installation, pipeline, provider, instance, pod, app_version, namespace)(increase(operatorkit_controller_event_count{pod=~"aws-operator.+|cluster-operator.+"}[10m])) == 0 and on (cluster_id, instance) (operatorkit_controller_deletion_timestamp or operatorkit_controller_creation_timestamp))
for: 20m
labels:
area: kaas
Expand All @@ -90,9 +90,9 @@ spec:
# be paged to be able to fix the issue immediately.
- alert: OperatorkitErrorRateTooHighKaas
annotations:
description: '{{`{{ $labels.namespace }}/{{ $labels.app }}@{{ $labels.app_version }} has reported errors. Please check the logs.`}}'
description: '{{`{{ $labels.namespace }}/{{ $labels.pod }}@{{ $labels.app_version }} has reported errors. Please check the logs.`}}'
opsrecipe: check-operator-error-rate-high/
expr: operatorkit_controller_error_total{app=~"ignition-operator|cert-operator|node-operator"} > 5
expr: operatorkit_controller_error_total{pod=~"ignition-operator.*|cert-operator.*|node-operator.*"} > 5
for: 1m
labels:
area: kaas
Expand All @@ -103,9 +103,9 @@ spec:
# be paged to be able to fix the issue immediately.
- alert: OperatorNotReconcilingProviderTeam
annotations:
description: '{{`{{ $labels.namespace }}/{{ $labels.app }}@{{ $labels.app_version }} has stopped the reconciliation. Please check logs.`}}'
description: '{{`{{ $labels.namespace }}/{{ $labels.pod }}@{{ $labels.app_version }} has stopped the reconciliation. Please check logs.`}}'
opsrecipe: operator-not-reconciling/
expr: (sum by (cluster_id, installation, pipeline, provider, instance, app, app_version, namespace)(increase(operatorkit_controller_event_count{app="node-operator"}[10m])) == 0 and on (cluster_id, instance) (operatorkit_controller_deletion_timestamp or operatorkit_controller_creation_timestamp))
expr: (sum by (cluster_id, installation, pipeline, provider, instance, pod, app_version, namespace)(increase(operatorkit_controller_event_count{pod=~"node-operator.*"}[10m])) == 0 and on (cluster_id, instance) (operatorkit_controller_deletion_timestamp or operatorkit_controller_creation_timestamp))
for: 20m
labels:
area: kaas
Expand Down
Loading

0 comments on commit 5d27ac0

Please sign in to comment.