Skip to content

Commit

Permalink
Merge branch 'main' into fix-dex-absent-query-for-mimir
Browse files Browse the repository at this point in the history
  • Loading branch information
QuentinBisson authored Jun 10, 2024
2 parents e071c9c + 00856f0 commit 7b967ac
Show file tree
Hide file tree
Showing 20 changed files with 21 additions and 20 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Use `ready` replicas for Kyverno webhooks alert.
- Moves ownership of alerts for shared components to turtles.


### Fixed

- Fixed usage of yq, and jq in check-opsrecipes.sh
- Fetch jq with make install-tools
- Fix and improve the check-opsrecipes.sh script so support <directory>/_index.md based ops-recipes.
- Fix cabbage alerts for multi-provider wcs.
- Fix shield alert area labels.
- Fix a few area labels.
- Fix `cert-exporter` alerting.

### Removed
Expand Down
4 changes: 0 additions & 4 deletions helm/prometheus-rules/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,6 @@ giantswarm.io/service-type: {{ .Values.serviceType }}
{{- end -}}
{{- end -}}

{{- define "isBastionBeingMonitored" -}}
{{ not (eq .Values.managementCluster.provider.flavor "capi") }}
{{- end -}}

{{- define "namespaceNotGiantswarm" -}}
"(([^g]|g[^i]|gi[^a]|gia[^n]|gian[^t]|giant[^s]|giants[^w]|giantsw[^a]|giantswa[^r]|giantswar[^m])*)"
{{- end -}}
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ spec:
expr: sum(increase(http_requests_total{app="dex", handler!="/token", code=~"^[4]..$|[5]..$", cluster_type="management_cluster"}[5m])) by (cluster_id, installation, pipeline, provider) > 10
for: 30m
labels:
area: managedapps
area: kaas
cancel_if_outside_working_hours: "true"
severity: page
team: bigmac
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ spec:
expr: sum(increase(aws_api_calls_total{error_code != ""}[20m])) by (error_code,namespace,pod,cluster_id) > 0
for: 40m
labels:
area: managedservices
area: kaas
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
Expand All @@ -36,7 +36,7 @@ spec:
expr: sum(increase(controller_runtime_reconcile_total{service="aws-load-balancer-controller", result = "error"}[20m])) by (controller,namespace,pod,cluster_id) > 0
for: 40m
labels:
area: managedservices
area: kaas
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ spec:
labels:
area: kaas
severity: notify
team: {{ include "providerTeam" . }}
team: phoenix
topic: aws
- alert: ServiceUsageApproachingLimit
annotations:
Expand All @@ -127,7 +127,7 @@ spec:
labels:
area: kaas
severity: notify
team: {{ include "providerTeam" . }}
team: phoenix
topic: aws
- alert: ManagementClusterContainerIsRestartingTooFrequentlyAWS
annotations:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ spec:
expr: cluster_service_key_pair_total > 2400
for: 10m
labels:
area: storage
area: kaas
severity: page
team: phoenix
topic: managementcluster
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ spec:
expr: increase(kiam_metadata_find_role_errors_total[10m]) > 0
for: 15m
labels:
area: managedservices
area: kaas
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ spec:
team: {{ include "providerTeam" . }}
topic: kubernetes

## TODO Split this alert into multiple alerts for each webhook.
# Webhooks that are not explicitely owner by any team (customer owned ones).
- alert: WorkloadClusterWebhookDurationExceedsTimeoutSolutionEngineers
annotations:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
{{- if eq (include "isBastionBeingMonitored" .) "true" }}
{{- if eq .Values.managementCluster.provider.flavor "vintage" }}
## TODO Remove when all vintage installations are gone
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ spec:
expr: kubelet_volume_stats_available_bytes{cluster_type="management_cluster", persistentvolumeclaim=~".*(alertmanager|loki|mimir|prometheus|pyroscope|tempo).*"}/kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*(alertmanager|loki|mimir|prometheus|pyroscope|tempo).*"} < 0.10
for: 1h
labels:
area: empowerment
area: platform
cancel_if_outside_working_hours: "true"
severity: page
team: atlas
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ metadata:
creationTimestamp: null
labels:
{{- include "labels.common" . | nindent 4 }}
name: network.all.rules
name: network.rules
namespace: {{ .Values.namespace }}
spec:
groups:
- name: network.all
- name: network
rules:
- alert: DNSErrorRateTooHigh
annotations:
Expand Down Expand Up @@ -44,6 +44,7 @@ spec:
severity: page
team: cabbage
topic: network
## TODO Sort those alerts ownership
- alert: NetworkErrorRateTooHigh
annotations:
description: '{{`Network error rate is too high for {{ or $labels.pod_name $labels.instance }} to {{ $labels.host }}.`}}'
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
## Cabbage is the only user of those recording rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ metadata:
creationTimestamp: null
labels:
{{- include "labels.common" . | nindent 4 }}
name: kyverno.all.rules
name: kyverno.rules
namespace: {{ .Values.namespace }}
spec:
groups:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ spec:
expr: label_replace(up{app=~"chart-operator.*"}, "ip", "$1.$2.$3.$4", "node", "ip-(\\d+)-(\\d+)-(\\d+)-(\\d+).*") == 0
for: 15m
labels:
area: managedservices
area: platform
cancel_if_cluster_control_plane_unhealthy: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
Expand Down
2 changes: 1 addition & 1 deletion loki/update.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ set -e

BRANCH="main"
MIXIN_URL=https://github.com/grafana/loki/production/loki-mixin@$BRANCH
OUTPUT_FILE="$(pwd)"/helm/prometheus-rules/templates/shared/recording-rules/loki-mixins.rules.yml
OUTPUT_FILE="$(pwd)"/helm/prometheus-rules/templates/platform/atlas/recording-rules/loki-mixins.rules.yml

cd loki
rm -rf vendor jsonnetfile.* "$OUTPUT_FILE"
Expand Down
2 changes: 1 addition & 1 deletion mimir/update.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ set -e

BRANCH="main"
MIXIN_URL=https://github.com/grafana/mimir/operations/mimir-mixin@$BRANCH
OUTPUT_FILE="$(pwd)"/helm/prometheus-rules/templates/shared/recording-rules/mimir-mixins.rules.yml
OUTPUT_FILE="$(pwd)"/helm/prometheus-rules/templates/platform/atlas/recording-rules/mimir-mixins.rules.yml

cd mimir
rm -rf vendor jsonnetfile.* "$OUTPUT_FILE"
Expand Down

0 comments on commit 7b967ac

Please sign in to comment.