feat: [PAYMCLOUD-168] aks module migrate metrics alert to log alerts (#…

…375) * Add custom log alerts for Kubernetes monitoring Introduced a new variable for custom log alerts and integrated azurerm_monitor_scheduled_query_rules_alert resource. Updated README and variables files to include the new configurations and descriptions for log alert criteria. * Update AKS alert action settings Modified the alert action settings to use default values for optional parameters. The email subject and custom webhook payload now have fallback values to ensure proper alert content even when not explicitly set. * Fix typo in variable name for custom logs alerts Renamed variable "custom_log_alerts" to "custom_logs_alerts" to ensure consistency with the rest of the codebase. Updated variable usage to reflect the new name. * Adjust action_group to use toset() for better compatibility Converted the action_group field to use toset() to ensure it handles lists properly and ensures type consistency. This change enhances compatibility with different Terraform configurations and prevents potential type errors. * Disable custom email subject and webhook payload Commented out the `email_subject` and `custom_webhook_payload` parameters in the AKS monitoring alert configuration files. This temporarily disables custom email subjects and webhook payloads for alerts, possibly to standardize notifications or troubleshoot issues. * Refactor alert action configuration and enable email subject. Revise the `action` block to concatenate action group IDs and reinstate optional fields for email subject and custom webhook payload. This improves the flexibility of alert configurations and the manageability of action settings. * Add local log alerts for node readiness and disk usage Replaced previous monitoring alert configurations for node readiness and disk usage with local log alerts to leverage dynamic query construction. This allows more precise alerting based on the Kubernetes cluster ID and other metrics. * Refactor monitoring alert queries to multiline format Convert long queries into multiline format using KQL blocks for better readability and maintainability. Adjusted the `node_not_ready` and `node_disk_usage` queries without changing their logic or functionality. * Fix escape sequences in monitoring alert queries Corrected the escape sequences in the KQL queries for `node_not_ready` and `node_disk_usage` monitoring alerts. This ensures proper evaluation of Kubernetes cluster IDs within the queries. * Fix KQL queries by correcting string escaping Corrected the string escaping in two KQL queries for monitoring alerts in the Terraform script. This change ensures that the KQL queries properly match the intended conditions and improve the reliability of monitoring alerts. * Fix syntax errors in KQL queries for monitoring alerts Removed extraneous quotation marks in KQL queries used for monitoring alerts to ensure proper execution and accurate alerting. This change resolves issues where KQL queries were not interpreted correctly due to syntax errors. * Upgrade `azurerm_monitor_scheduled_query_rules_alert` to v2 Refactor the monitoring alerts to use `azurerm_monitor_scheduled_query_rules_alert_v2`. This includes additional attributes for better alert configuration and updated alert definitions in both the resource and variables files. * Remove `scopes` field from monitoring alerts variable The `scopes` field has been removed from the `99_variables_monitoring_alerts.tf` file as it was required and would force a new resource creation. This update simplifies the configuration and removes the restriction of having exactly one resource ID in the scopes list. * Change variable types in monitoring alerts Modified `window_duration` and `evaluation_frequency` from number to string in the monitoring alerts variables. This ensures compatibility with ISO 8601 duration format and aligns with documented possible values. * Fix incorrect lookup key in AKS monitoring configuration Corrected the key used in the lookup function for workspace_alerts_storage_enabled. This change ensures the configuration correctly retrieves the value from the input map, preventing potential runtime errors. * Reduce alert evaluation periods to improve responsiveness Changed the number of evaluation periods from 3 to 1 for monitoring alerts. This adjustment aims to decrease the time required to trigger alerts, enhancing responsiveness to potential issues. * Reduce minimum failing periods for alert triggering Changed the `minimum_failing_periods_to_trigger_alert` from 3 to 1 in the alert configuration files. This adjustment will allow alerts to trigger more quickly, improving the system's responsiveness to potential issues. * Update evaluation frequency to 5 minutes Set the evaluation frequency of monitoring alerts to 5 minutes for better consistency. This change applies to both the status and avgDiskUsage monitoring alert configurations. * Update metric column names and add KQL aggregation Changed `metric_measure_column` values to include prefixes for clarity. Added aggregation method in KQL query to summarize average disk usage results. * Update metric measure column name in monitoring alerts Changed the metric_measure_column field from "count_Status" to "count_". This adjustment ensures the column name aligns with the updated schema and prevents potential mismatches during data aggregation. * Adjust monitoring alert settings for longer evaluation periods Updated the monitoring alerts KQL queries to extend the time window and evaluation frequency. This helps in reducing the noise from frequent but short-lived issues, providing a more accurate set of alerts for the system's actual status. * Update alert display names for AKS nodes Revised the display names for node readiness and disk usage alerts in the Kubernetes cluster monitoring configuration. Now, the display names include the AKS cluster name for better identification and clarity in alert notifications. * Make alert configuration more resilient. Added lookup functions to provide default values for alert configurations, ensuring they are more robust against missing or undefined values. This improves stability and reduces the probability of runtime errors due to missing configuration fields. * Make alert parameters optional Updated several alert parameters to be optional in `99_variables_monitoring_alerts.tf`. This change allows for more flexible configurations and defaults, improving usability and customization of alerts. * Enable skip_query_validation in AKS monitoring Changed the `skip_query_validation` default to true in the AKS monitoring script. This adjustment is likely to bypass query validation checks, streamlining alert configurations and potentially reducing deployment issues. * Fix Markdown formatting in README.md Corrected the Markdown formatting for several code blocks in the README.md file. This change improves the readability and consistency of the documentation. * Format Markdown tables to improve readability This commit adjusts the formatting of Markdown tables in the README.md file to improve their readability. It removes unnecessary slashes in some table rows, converting them into a more uniform and cleaner format. * Update README.md for consistent Markdown formatting Updated the README.md files for `kubernetes_cluster` and `kubernetes_cluster_udr` to ensure consistent Markdown formatting across all sections. This mainly involved changing line breaks in code blocks to improve readability and maintain a uniform style. * Update alert configurations for disk usage monitoring Renamed dimensions and adjusted alert parameters for disk usage. Changed threshold, window duration, evaluation frequency, and refined the metric to focus on 'Computer' instead of 'AvgDiskUsage' directly. * Fix metric measure column name in monitoring alerts variable Corrected the metric measure column from "any_AvgDiskUsage" to "AvgDiskUsage" in the monitoring alerts configuration. This change ensures the metric measure is correctly referenced, preventing potential errors in alert triggers. * Change alert severity from 1 to 2 Adjusted the severity of the monitoring alert for disk usage. This change re-prioritizes the alert level, likely based on a revised risk assessment or operational need. * Remove deprecated log_alerts_application_insight_id variable Deleted the log_alerts_application_insight_id variable from 99_variables.tf and updated the README accordingly to reflect this change. This variable is no longer needed in our configuration setup. * Add notes on new custom alerts and decommissioned metrics Introduce new variables for custom log alerts and detail mandatory changes due to Azure's decommission of certain metric alerts from May 2024. Specify which metrics will be phased out in version v8.57.0 of the module. * Refactor metric alerts and add OOMKilled log alert. Revised the configuration for metric alerts in the README, consolidating them into `default_metric_alerts` and `custom_metric_alerts` variables. Added a new custom log alert for detecting OOMKilled pods with specific parameters for alerting. * Adjust monitoring alert timings for more frequent evaluations Reduce the window duration from 1 hour to 30 minutes and the evaluation frequency from 15 minutes to 10 minutes. This change enables quicker detection of issues in the Kubernetes cluster. * Add severity to monitoring alerts Added a severity field to the monitoring alerts configuration, allowing more granular control over alert prioritization. Default severity is set to 3, with the ability to customize this value per alert.
pagopa · Nov 20, 2024 · 63f6181 · 63f6181
1 parent dfe8273
commit 63f6181
Show file tree

Hide file tree

Showing 3 changed files with 276 additions and 224 deletions.
diff --git a/kubernetes_cluster/02_monitor_aks.tf b/kubernetes_cluster/02_monitor_aks.tf
@@ -10,6 +10,7 @@ resource "azurerm_monitor_metric_alert" "this" {
   frequency           = each.value.frequency
   window_size         = each.value.window_size
   enabled             = var.alerts_enabled
+  severity            = lookup(each.value, "severity", 3)
 
   dynamic "action" {
     for_each = var.action
@@ -44,6 +45,64 @@ resource "azurerm_monitor_metric_alert" "this" {
   ]
 }
 
+resource "azurerm_monitor_scheduled_query_rules_alert_v2" "this" {
+  for_each = local.log_alerts
+
+  name         = "${azurerm_kubernetes_cluster.this.name}-${upper(each.key)}"
+  description  = each.value.description
+  display_name = each.value.display_name
+  enabled      = var.alerts_enabled
+
+  resource_group_name  = var.resource_group_name
+  scopes               = [azurerm_kubernetes_cluster.this.id]
+  location             = var.location
+  evaluation_frequency = each.value.evaluation_frequency
+  window_duration      = each.value.window_duration
+
+  # Assuming each.value includes this attribute
+  severity = each.value.severity
+
+  criteria {
+    query                   = each.value.query
+    operator                = each.value.operator
+    threshold               = each.value.threshold
+    time_aggregation_method = lookup(each.value, "time_aggregation_method", "Average")
+
+    resource_id_column    = each.value.resource_id_column
+    metric_measure_column = lookup(each.value, "metric_measure_column", null)
+
+    dynamic "dimension" {
+      for_each = each.value.dimension
+      content {
+        name     = dimension.value.name
+        operator = dimension.value.operator
+        values   = dimension.value.values
+      }
+    }
+
+    failing_periods {
+      minimum_failing_periods_to_trigger_alert = lookup(each.value, "minimum_failing_periods_to_trigger_alert", 1)
+      number_of_evaluation_periods             = lookup(each.value, "number_of_evaluation_periods", 1)
+    }
+  }
+
+  auto_mitigation_enabled          = lookup(each.value, "auto_mitigation_enabled", true)
+  workspace_alerts_storage_enabled = lookup(each.value, "workspace_alerts_storage_enabled", false)
+  skip_query_validation            = lookup(each.value, "skip_query_validation", true)
+
+  action {
+    // Concatenazione di tutti gli ID dei gruppi d'azione in un singolo set di stringhe
+    action_groups     = [for g in var.action : g.action_group_id]
+    custom_properties = {}
+  }
+
+  tags = var.tags
+
+  depends_on = [
+    azurerm_kubernetes_cluster.this
+  ]
+}
+
 resource "azurerm_monitor_diagnostic_setting" "aks" {
   count                      = var.sec_log_analytics_workspace_id != null ? 1 : 0
   name                       = "LogSecurity"

diff --git a/kubernetes_cluster/99_variables_monitoring_alerts.tf b/kubernetes_cluster/99_variables_monitoring_alerts.tf
@@ -15,6 +15,8 @@ variable "default_metric_alerts" {
     # criteria.0.operator to be one of [Equals NotEquals GreaterThan GreaterThanOrEqual LessThan LessThanOrEqual]
     operator  = string
     threshold = number
+    # Possible values are 0, 1, 2, 3 and 4. Defaults to 3.
+    severity = optional(number)
     # Possible values are PT1M, PT5M, PT15M, PT30M and PT1H
     frequency = string
     # Possible values are PT1M, PT5M, PT15M, PT30M, PT1H, PT6H, PT12H and P1D.
@@ -39,6 +41,7 @@ variable "default_metric_alerts" {
       metric_name      = "node_cpu_usage_percentage"
       operator         = "GreaterThan"
       threshold        = 80
+      severity         = 2
       frequency        = "PT15M"
       window_size      = "PT1H"
       dimension = [
@@ -56,6 +59,7 @@ variable "default_metric_alerts" {
       metric_name      = "node_memory_working_set_percentage"
       operator         = "GreaterThan"
       threshold        = 80
+      severity         = 2
       frequency        = "PT15M"
       window_size      = "PT1H"
       dimension = [
@@ -66,49 +70,13 @@ variable "default_metric_alerts" {
         }
       ],
     }
-    node_disk = {
-      aggregation      = "Average"
-      metric_namespace = "Microsoft.ContainerService/managedClusters"
-      metric_name      = "node_disk_usage_percentage"
-      operator         = "GreaterThan"
-      threshold        = 80
-      frequency        = "PT15M"
-      window_size      = "PT1H"
-      dimension = [
-        {
-          name     = "node"
-          operator = "Include"
-          values   = ["*"]
-        },
-        {
-          name     = "device"
-          operator = "Include"
-          values   = ["*"]
-        }
-      ]
-    }
-    node_not_ready = {
-      aggregation      = "Average"
-      metric_namespace = "Microsoft.ContainerService/managedClusters"
-      metric_name      = "kube_node_status_condition"
-      operator         = "GreaterThan"
-      threshold        = 0
-      frequency        = "PT15M"
-      window_size      = "PT1H"
-      dimension = [
-        {
-          name     = "status2"
-          operator = "Include"
-          values   = ["NotReady"]
-        }
-      ],
-    }
     pods_failed = {
       aggregation      = "Average"
       metric_namespace = "Microsoft.ContainerService/managedClusters"
       metric_name      = "kube_pod_status_phase"
       operator         = "GreaterThan"
       threshold        = 0
+      severity         = 1
       frequency        = "PT15M"
       window_size      = "PT1H"
       dimension = [
@@ -160,6 +128,160 @@ variable "custom_metric_alerts" {
   }))
 }
 
+# Setting locals logs alerts, because i need interpolation to set query correctly
+locals {
+  default_logs_alerts = {
+    ### NODE NOT READY ALERT
+    node_not_ready = {
+      display_name            = "${azurerm_kubernetes_cluster.this.name}-NODE-NOT-READY"
+      description             = "Detect nodes that is not ready on AKS cluster"
+      query                   = <<-KQL
+        KubeNodeInventory
+        | where ClusterId == "${azurerm_kubernetes_cluster.this.id}"
+        | where TimeGenerated > ago(15m)
+        | where Status == "NotReady"
+        | summarize count() by Computer, Status
+      KQL
+      severity                = 1
+      window_duration         = "PT30M"
+      evaluation_frequency    = "PT10M"
+      operator                = "GreaterThan"
+      threshold               = 1
+      time_aggregation_method = "Average"
+      resource_id_column      = "Status"
+      metric_measure_column   = "count_"
+      dimension = [
+        {
+          name     = "Computer"
+          operator = "Include"
+          values   = ["*"]
+        }
+      ]
+      minimum_failing_periods_to_trigger_alert = 1
+      number_of_evaluation_periods             = 1
+      auto_mitigation_enabled                  = true
+      workspace_alerts_storage_enabled         = false
+      skip_query_validation                    = true
+    }
+    ### NODE DISK ALERT
+    node_disk_usage = {
+      display_name            = "${azurerm_kubernetes_cluster.this.name}-NODE-DISK-USAGE"
+      description             = "Detect nodes disk is going to run out of space"
+      query                   = <<-KQL
+        InsightsMetrics
+        | where _ResourceId == "${lower(azurerm_kubernetes_cluster.this.id)}"
+        | where TimeGenerated > ago(15m)
+        | where Namespace == "container.azm.ms/disk"
+        | where Name == "used_percent"
+        | project TimeGenerated, Computer, Val, Origin
+        | summarize AvgDiskUsage = avg(Val) by Computer
+      KQL
+      severity                = 2
+      window_duration         = "PT30M"
+      evaluation_frequency    = "PT10M"
+      operator                = "GreaterThan"
+      threshold               = 90
+      time_aggregation_method = "Average"
+      resource_id_column      = "AvgDiskUsage"
+      metric_measure_column   = "AvgDiskUsage"
+      dimension = [
+        {
+          name     = "Computer"
+          operator = "Include"
+          values   = ["*"]
+        }
+      ]
+      minimum_failing_periods_to_trigger_alert = 1
+      number_of_evaluation_periods             = 1
+      auto_mitigation_enabled                  = true
+      workspace_alerts_storage_enabled         = false
+      skip_query_validation                    = true
+    }
+  }
+}
+
+variable "custom_logs_alerts" {
+  description = <<EOD
+  Map of name = criteria objects
+  EOD
+
+  default = {}
+
+  type = map(object({
+    # (Optional) Specifies the display name of the alert rule.
+    display_name = string
+    # (Optional) Specifies the description of the scheduled query rule.
+    description = string
+    # Assuming each.value includes this attribute for Kusto Query Language (KQL)
+    query = string
+    # (Required) Severity of the alert. Should be an integer between 0 and 4.
+    # Value of 0 is severest.
+    severity = number
+    # (Required) Specifies the period of time in ISO 8601 duration format on
+    # which the Scheduled Query Rule will be executed (bin size).
+    # If evaluation_frequency is PT1M, possible values are PT1M, PT5M, PT10M,
+    # PT15M, PT30M, PT45M, PT1H, PT2H, PT3H, PT4H, PT5H, and PT6H. Otherwise,
+    # possible values are PT5M, PT10M, PT15M, PT30M, PT45M, PT1H, PT2H, PT3H,
+    # PT4H, PT5H, PT6H, P1D, and P2D.
+    window_duration = optional(string)
+    # (Optional) How often the scheduled query rule is evaluated, represented
+    # in ISO 8601 duration format. Possible values are PT1M, PT5M, PT10M, PT15M,
+    # PT30M, PT45M, PT1H, PT2H, PT3H, PT4H, PT5H, PT6H, P1D.
+    evaluation_frequency = string
+    # Evaluation operation for rule - 'GreaterThan', GreaterThanOrEqual',
+    # 'LessThan', or 'LessThanOrEqual'.
+    operator = string
+    # Result or count threshold based on which rule should be triggered.
+    # Values must be between 0 and 10000 inclusive.
+    threshold = number
+    # (Required) The type of aggregation to apply to the data points in
+    # aggregation granularity. Possible values are Average, Count, Maximum,
+    # Minimum,and Total.
+    time_aggregation_method = string
+    # (Optional) Specifies the column containing the resource ID. The content
+    # of the column must be an uri formatted as resource ID.
+    resource_id_column = optional(string)
+
+    # (Optional) Specifies the column containing the metric measure number.
+    metric_measure_column = optional(string)
+
+    dimension = list(object(
+      {
+        # (Required) Name of the dimension.
+        name = string
+        # (Required) Operator for dimension values. Possible values are
+        # Exclude,and Include.
+        operator = string
+        # (Required) List of dimension values. Use a wildcard * to collect all.
+        values = list(string)
+      }
+    ))
+
+    # (Required) Specifies the number of violations to trigger an alert.
+    # Should be smaller or equal to number_of_evaluation_periods.
+    # Possible value is integer between 1 and 6.
+    minimum_failing_periods_to_trigger_alert = number
+    # (Required) Specifies the number of aggregated look-back points.
+    # The look-back time window is calculated based on the aggregation
+    # granularity window_duration and the selected number of aggregated points.
+    # Possible value is integer between 1 and 6.
+    number_of_evaluation_periods = number
+
+    # (Optional) Specifies the flag that indicates whether the alert should
+    # be automatically resolved or not. Value should be true or false.
+    # The default is false.
+    auto_mitigation_enabled = optional(bool)
+    # (Optional) Specifies the flag which indicates whether this scheduled
+    # query rule check if storage is configured. Value should be true or false.
+    # The default is false.
+    workspace_alerts_storage_enabled = optional(bool)
+    # (Optional) Specifies the flag which indicates whether the provided
+    # query should be validated or not. The default is false.
+    skip_query_validation = optional(bool)
+  }))
+}
+
+
 variable "action" {
   description = "The ID of the Action Group and optional map of custom string properties to include with the post webhook operation."
   type = set(object(
@@ -180,3 +302,7 @@ variable "alerts_enabled" {
 locals {
   metric_alerts = merge(var.default_metric_alerts, var.custom_metric_alerts)
 }
+
+locals {
+  log_alerts = merge(var.custom_logs_alerts, local.default_logs_alerts)
+}