Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(kube-prom-stack): added some extra rules #749

Merged
merged 1 commit into from
May 13, 2024

Conversation

thiagoalmeidasa
Copy link
Owner

No description provided.

@github-actions github-actions bot added the area/kubernetes Changes made in the kubernetes directory label May 13, 2024
Copy link

github-actions bot commented May 13, 2024

--- kubernetes/apps/monitoring/kube-prometheus-stack/app Kustomization: flux-system/cluster-apps-kube-prometheus-stack HelmRelease: monitoring/kube-prometheus-stack

+++ kubernetes/apps/monitoring/kube-prometheus-stack/app Kustomization: flux-system/cluster-apps-kube-prometheus-stack HelmRelease: monitoring/kube-prometheus-stack

@@ -35,20 +35,12 @@

                 requests:
                   storage: 1Gi
               storageClassName: longhorn
       config:
         global:
           resolve_timeout: 5m
-        inhibit_rules:
-        - equal:
-          - alertname
-          - namespace
-          source_match:
-            severity: critical
-          target_match:
-            severity: warning
         receivers:
         - name: 'null'
         - name: pagerduty
           pagerduty_configs:
           - service_key: ${PAGERDUTY_KEY}
         - name: dead-mans-switch
@@ -69,13 +61,12 @@

               alertname: Watchdog
             receiver: dead-mans-switch
             repeat_interval: 6m
           - continue: true
             matchers:
             - severity=~"critical|page"
-            - alertname!="KubeAPIErrorBudgetBurn"
             receiver: pagerduty
       fullnameOverride: alertmanager
       ingress:
         annotations:
           hajimari.io/appName: Alert Manager
           hajimari.io/enable: 'true'
--- kubernetes/apps/monitoring/kube-prometheus-stack/app Kustomization: flux-system/cluster-apps-kube-prometheus-stack PrometheusRule: monitoring/miscellaneous-rules

+++ kubernetes/apps/monitoring/kube-prometheus-stack/app Kustomization: flux-system/cluster-apps-kube-prometheus-stack PrometheusRule: monitoring/miscellaneous-rules

@@ -0,0 +1,86 @@

+---
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  labels:
+    kustomize.toolkit.fluxcd.io/name: cluster-apps-kube-prometheus-stack
+    kustomize.toolkit.fluxcd.io/namespace: flux-system
+  name: miscellaneous-rules
+  namespace: monitoring
+spec:
+  groups:
+  - name: k8s.rules
+    rules:
+    - alert: KubernetesContainerOomKiller
+      annotations:
+        description: |-
+          Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.
+            VALUE = {{ $value }}
+            LABELS = {{ $labels }}
+        summary: Kubernetes Container oom killer (instance {{ $labels.instance }})
+      expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total
+        offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m])
+        == 1
+      for: 0m
+      labels:
+        severity: critical
+    - alert: KubernetesPodCrashLooping
+      annotations:
+        description: |-
+          Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping
+            VALUE = {{ $value }}
+            LABELS = {{ $labels }}
+        summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
+      expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
+      for: 2m
+      labels:
+        severity: critical
+    - alert: KubernetesNodeMemoryPressure
+      annotations:
+        description: |-
+          Node {{ $labels.node }} has MemoryPressure condition
+            VALUE = {{ $value }}
+            LABELS = {{ $labels }}
+        summary: Kubernetes Node memory pressure (instance {{ $labels.instance }})
+      expr: kube_node_status_condition{condition="MemoryPressure",status="true"} ==
+        1
+      for: 2m
+      labels:
+        severity: critical
+    - alert: KubernetesNodeDiskPressure
+      annotations:
+        description: |-
+          Node {{ $labels.node }} has DiskPressure condition
+            VALUE = {{ $value }}
+            LABELS = {{ $labels }}
+        summary: Kubernetes Node disk pressure (instance {{ $labels.instance }})
+      expr: kube_node_status_condition{condition="DiskPressure",status="true"} ==
+        1
+      for: 2m
+      labels:
+        severity: critical
+    - alert: KubernetesNodeNotReady
+      annotations:
+        description: |-
+          Node {{ $labels.node }} has been unready for a long time
+            VALUE = {{ $value }}
+            LABELS = {{ $labels }}
+        summary: Kubernetes Node not ready (instance {{ $labels.instance }})
+      expr: kube_node_status_condition{condition="Ready",status="true"} == 0
+      for: 10m
+      labels:
+        severity: critical
+    - alert: KubernetesVolumeFullInFourDays
+      annotations:
+        description: |-
+          Volume under {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.
+            VALUE = {{ $value }}
+            LABELS = {{ $labels }}
+        summary: Kubernetes Volume full in four days (instance {{ $labels.instance
+          }})
+      expr: predict_linear(kubelet_volume_stats_available_bytes[6h:5m], 4 * 24 * 3600)
+        < 0
+      for: 0m
+      labels:
+        severity: critical
+

@thiagoalmeidasa thiagoalmeidasa merged commit 7e110ac into main May 13, 2024
4 checks passed
@thiagoalmeidasa thiagoalmeidasa deleted the updated-prom-rules branch May 13, 2024 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubernetes Changes made in the kubernetes directory
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant