diff --git a/helm/alloy/Chart.yaml b/helm/alloy/Chart.yaml new file mode 100644 index 00000000..3ebd99ee --- /dev/null +++ b/helm/alloy/Chart.yaml @@ -0,0 +1,30 @@ +apiVersion: v2 +name: alloy +description: A Helm chart for deploying Grafana Alloy + +# A chart can be either an 'application' or a 'library' chart. +# +# Application charts are a collection of templates that can be packaged into versioned archives +# to be deployed. +# +# Library charts provide useful utilities or functions for the chart developer. They're included as +# a dependency of application charts to inject those utilities and functions into the rendering +# pipeline. Library charts do not define any templates and therefore cannot be deployed. +type: application + +# This is the chart version. This version number should be incremented each time you make changes +# to the chart and its templates, including the app version. +# Versions are expected to follow Semantic Versioning (https://semver.org/) +version: 0.1.0 + +# This is the version number of the application being deployed. This version number should be +# incremented each time you make changes to the application. Versions are not expected to +# follow Semantic Versioning. They should reflect the version the application is using. +# It is recommended to use it with quotes. +appVersion: "master" + +# Dependencies +dependencies: + - name: alloy + version: "0.9.1" + repository: "https://grafana.github.io/helm-charts" diff --git a/helm/alloy/SETUP.md b/helm/alloy/SETUP.md new file mode 100644 index 00000000..0b3e1c74 --- /dev/null +++ b/helm/alloy/SETUP.md @@ -0,0 +1,55 @@ +# Grafana Alloy + +## Overview + +This document provides a guide for deploying Grafana Alloy to your Kubernetes cluster using Helm. Grafana Alloy is a powerful observability tool that collects and ships logs and metrics from your services to Grafana Loki and Mimir for storage and analysis. By deploying Alloy, you can gain deep insights into your system’s performance, track key metrics, and troubleshoot issues efficiently. + +In this deployment, the Alloy ConfigMap plays a crucial role in configuring which logs are collected for Loki and which metrics are gathered for Mimir. It also specifies the endpoints for Loki and Mimir where the data will be sent. + +Before deploying Alloy, it is important to first deploy the "observability" Helm chart, as it provides the necessary components and configuration for Alloy to function properly. Please refer to the SETUP.md observability chart documentation for instructions on how to set it up before proceeding with the Alloy deployment. + +## Configuring Alloy + +### Helm Chart Configuration + +The Alloy configuration is the key component that allows users to customize what logs are collected for Loki and which metrics are collected for Mimir. Through this configuration, you can define the specific endpoints where logs and metrics should be sent, ensuring that data is properly routed for observability and analysis. + +In this configuration, it is important to replace the placeholder hostnames (*.example.com) with the actual Loki and Mimir hostnames that were configured in the "observability" Helm chart. This ensures that logs are sent to the correct Loki endpoint and metrics are forwarded to the appropriate Mimir endpoint, allowing your observability stack to function effectively. Additionally, you can fine-tune the alloyConfigmapData to suit your environment's needs. Please click [here](https://grafana.com/docs/alloy/latest/reference/components/#components) to see in-depth documentation on how to do so. + +```yaml + // Write Endpoints + // prometheus write endpoint + prometheus.remote_write "default" { + external_labels = { + cluster = "{{ .Values.cluster }}", + project = "{{ .Values.project }}", + } + endpoint { + url = "https://mimir.example.com/api/v1/push" + + headers = { + "X-Scope-OrgID" = "anonymous", + } + + } + } + + // loki write endpoint + loki.write "endpoint" { + external_labels = { + cluster = "{{ .Values.cluster }}", + project = "{{ .Values.project }}", + } + endpoint { + url = "https://loki.example.com/loki/api/v1/push" + } + } +``` +### Helm Chart Links +The link below will take you to the Grafana Alloy chart, providing a comprehensive list of configurable options to help you further customize your setup. + +[Alloy Helm Chart](https://github.com/grafana/alloy/blob/main/operations/helm/charts/alloy/values.yaml) + +--- + +By following this guide, you'll successfully configure Alloy to send logs and metrics to Grafana Loki and Mimir. The setup will ensure that Alloy collects the necessary observability data from your environment and forwards logs to Loki and metrics to Mimir for analysis and storage. This configuration will allow you to monitor your system's logs and metrics efficiently through Grafana. \ No newline at end of file diff --git a/helm/alloy/templates/alloy-config.yaml b/helm/alloy/templates/alloy-config.yaml new file mode 100644 index 00000000..0bf02875 --- /dev/null +++ b/helm/alloy/templates/alloy-config.yaml @@ -0,0 +1,9 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: alloy-gen3 +data: + config: | + {{- with .Values.alloy.alloyConfigmapData }} + {{- toYaml . | nindent 4 }} + {{ end }} \ No newline at end of file diff --git a/helm/alloy/values.yaml b/helm/alloy/values.yaml new file mode 100644 index 00000000..146cb8ea --- /dev/null +++ b/helm/alloy/values.yaml @@ -0,0 +1,445 @@ +alloy: + controller: + type: "deployment" + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: topology.kubernetes.io/zone + operator: In + values: + - us-east-1a + + alloy: + stabilityLevel: "public-preview" + uiPathPrefix: /alloy + # -- Extra ports to expose on the Alloy container. + extraPorts: + - name: "otel-grpc" + port: 4317 + targetPort: 4317 + protocol: "TCP" + - name: "otel-http" + port: 4318 + targetPort: 4318 + protocol: "TCP" + clustering: + enabled: true + configMap: + name: alloy-gen3 + key: config + resources: + requests: + cpu: 1000m + memory: 1Gi + + alloyConfigmapData: | + logging { + level = "info" + format = "json" + write_to = [loki.write.endpoint.receiver] + } + + /////////////////////// OTLP START /////////////////////// + + otelcol.receiver.otlp "default" { + grpc {} + http {} + + output { + metrics = [otelcol.processor.batch.default.input] + traces = [otelcol.processor.batch.default.input] + } + } + + otelcol.processor.batch "default" { + output { + metrics = [otelcol.exporter.prometheus.default.input] + traces = [otelcol.exporter.otlp.tempo.input] + } + } + + otelcol.exporter.prometheus "default" { + forward_to = [prometheus.remote_write.default.receiver] + } + + otelcol.exporter.otlp "tempo" { + client { + endpoint = "http://monitoring-tempo-distributor.monitoring:4317" + // Configure TLS settings for communicating with the endpoint. + tls { + // The connection is insecure. + insecure = true + // Do not verify TLS certificates when connecting. + insecure_skip_verify = true + } + } + } + + + /////////////////////// OTLP END /////////////////////// + + // discover all pods, to be used later in this config + discovery.kubernetes "pods" { + role = "pod" + } + + // discover all services, to be used later in this config + discovery.kubernetes "services" { + role = "service" + } + + // discover all nodes, to be used later in this config + discovery.kubernetes "nodes" { + role = "node" + } + + // Generic scrape of any pod with Annotation "prometheus.io/scrape: true" + discovery.relabel "annotation_autodiscovery_pods" { + targets = discovery.kubernetes.pods.targets + rule { + source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_scrape"] + regex = "true" + action = "keep" + } + rule { + source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_job"] + action = "replace" + target_label = "job" + } + rule { + source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_instance"] + action = "replace" + target_label = "instance" + } + rule { + source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_path"] + action = "replace" + target_label = "__metrics_path__" + } + + // Choose the pod port + // The discovery generates a target for each declared container port of the pod. + // If the metricsPortName annotation has value, keep only the target where the port name matches the one of the annotation. + rule { + source_labels = ["__meta_kubernetes_pod_container_port_name"] + target_label = "__tmp_port" + } + rule { + source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_portName"] + regex = "(.+)" + target_label = "__tmp_port" + } + rule { + source_labels = ["__meta_kubernetes_pod_container_port_name"] + action = "keepequal" + target_label = "__tmp_port" + } + + // If the metrics port number annotation has a value, override the target address to use it, regardless whether it is + // one of the declared ports on that Pod. + rule { + source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_port", "__meta_kubernetes_pod_ip"] + regex = "(\\d+);(([A-Fa-f0-9]{1,4}::?){1,7}[A-Fa-f0-9]{1,4})" + replacement = "[$2]:$1" // IPv6 + target_label = "__address__" + } + rule { + source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_port", "__meta_kubernetes_pod_ip"] + regex = "(\\d+);((([0-9]+?)(\\.|$)){4})" // IPv4, takes priority over IPv6 when both exists + replacement = "$2:$1" + target_label = "__address__" + } + + rule { + source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_scheme"] + action = "replace" + target_label = "__scheme__" + } + + + // add labels + rule { + source_labels = ["__meta_kubernetes_pod_name"] + target_label = "pod" + } + rule { + source_labels = ["__meta_kubernetes_pod_container_name"] + target_label = "container" + } + rule { + source_labels = ["__meta_kubernetes_pod_controller_name"] + target_label = "controller" + } + + rule { + source_labels = ["__meta_kubernetes_namespace"] + target_label = "namespace" + } + + + rule { + source_labels = ["__meta_kubernetes_pod_label_app"] + target_label = "app" + } + + // map all labels + rule { + action = "labelmap" + regex = "__meta_kubernetes_pod_label_(.+)" + } + } + + // Generic scrape of any service with + // Annotation Autodiscovery + discovery.relabel "annotation_autodiscovery_services" { + targets = discovery.kubernetes.services.targets + rule { + source_labels = ["__meta_kubernetes_service_annotation_prometheus_io_scrape"] + regex = "true" + action = "keep" + } + rule { + source_labels = ["__meta_kubernetes_service_annotation_prometheus_io_job"] + action = "replace" + target_label = "job" + } + rule { + source_labels = ["__meta_kubernetes_service_annotation_prometheus_io_instance"] + action = "replace" + target_label = "instance" + } + rule { + source_labels = ["__meta_kubernetes_service_annotation_prometheus_io_path"] + action = "replace" + target_label = "__metrics_path__" + } + + // Choose the service port + rule { + source_labels = ["__meta_kubernetes_service_port_name"] + target_label = "__tmp_port" + } + rule { + source_labels = ["__meta_kubernetes_service_annotation_prometheus_io_portName"] + regex = "(.+)" + target_label = "__tmp_port" + } + rule { + source_labels = ["__meta_kubernetes_service_port_name"] + action = "keepequal" + target_label = "__tmp_port" + } + + rule { + source_labels = ["__meta_kubernetes_service_port_number"] + target_label = "__tmp_port" + } + rule { + source_labels = ["__meta_kubernetes_service_annotation_prometheus_io_port"] + regex = "(.+)" + target_label = "__tmp_port" + } + rule { + source_labels = ["__meta_kubernetes_service_port_number"] + action = "keepequal" + target_label = "__tmp_port" + } + + rule { + source_labels = ["__meta_kubernetes_service_annotation_prometheus_io_scheme"] + action = "replace" + target_label = "__scheme__" + } + } + + prometheus.scrape "metrics" { + job_name = "integrations/autodiscovery_metrics" + targets = concat(discovery.relabel.annotation_autodiscovery_pods.output, discovery.relabel.annotation_autodiscovery_services.output) + honor_labels = true + clustering { + enabled = true + } + forward_to = [prometheus.relabel.metrics_service.receiver] + } + + + // Node Exporter + // TODO: replace with https://grafana.com/docs/alloy/latest/reference/components/prometheus.exporter.unix/ + discovery.relabel "node_exporter" { + targets = discovery.kubernetes.pods.targets + rule { + source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_instance"] + regex = "monitoring-extras" + action = "keep" + } + rule { + source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name"] + regex = "node-exporter" + action = "keep" + } + rule { + source_labels = ["__meta_kubernetes_pod_node_name"] + action = "replace" + target_label = "instance" + } + } + + prometheus.scrape "node_exporter" { + job_name = "integrations/node_exporter" + targets = discovery.relabel.node_exporter.output + scrape_interval = "60s" + clustering { + enabled = true + } + forward_to = [prometheus.relabel.node_exporter.receiver] + } + + prometheus.relabel "node_exporter" { + rule { + source_labels = ["__name__"] + regex = "up|node_cpu.*|node_network.*|node_exporter_build_info|node_filesystem.*|node_memory.*|process_cpu_seconds_total|process_resident_memory_bytes" + action = "keep" + } + forward_to = [prometheus.relabel.metrics_service.receiver] + } + + // Logs from all pods + discovery.relabel "all_pods" { + targets = discovery.kubernetes.pods.targets + rule { + source_labels = ["__meta_kubernetes_namespace"] + target_label = "namespace" + } + rule { + source_labels = ["__meta_kubernetes_pod_name"] + target_label = "pod" + } + rule { + source_labels = ["__meta_kubernetes_pod_container_name"] + target_label = "container" + } + rule { + source_labels = ["__meta_kubernetes_pod_controller_name"] + target_label = "controller" + } + + rule { + source_labels = ["__meta_kubernetes_pod_label_app"] + target_label = "app" + } + + // map all labels + rule { + action = "labelmap" + regex = "__meta_kubernetes_pod_label_(.+)" + } + + } + + loki.source.kubernetes "pods" { + targets = discovery.relabel.all_pods.output + forward_to = [loki.write.endpoint.receiver] + } + + // kube-state-metrics + discovery.relabel "relabel_kube_state_metrics" { + targets = discovery.kubernetes.services.targets + rule { + source_labels = ["__meta_kubernetes_namespace"] + regex = "monitoring" + action = "keep" + } + rule { + source_labels = ["__meta_kubernetes_service_name"] + regex = "monitoring-extras-kube-state-metrics" + action = "keep" + } + } + + prometheus.scrape "kube_state_metrics" { + targets = discovery.relabel.relabel_kube_state_metrics.output + job_name = "kube-state-metrics" + metrics_path = "/metrics" + forward_to = [prometheus.remote_write.default.receiver] + } + + // Kubelet + discovery.relabel "kubelet" { + targets = discovery.kubernetes.nodes.targets + rule { + target_label = "__address__" + replacement = "kubernetes.default.svc.cluster.local:443" + } + rule { + source_labels = ["__meta_kubernetes_node_name"] + regex = "(.+)" + replacement = "/api/v1/nodes/${1}/proxy/metrics" + target_label = "__metrics_path__" + } + } + + prometheus.scrape "kubelet" { + job_name = "integrations/kubernetes/kubelet" + targets = discovery.relabel.kubelet.output + scheme = "https" + scrape_interval = "60s" + bearer_token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token" + tls_config { + insecure_skip_verify = true + } + clustering { + enabled = true + } + forward_to = [prometheus.relabel.kubelet.receiver] + } + + prometheus.relabel "kubelet" { + rule { + source_labels = ["__name__"] + regex = "up|container_cpu_usage_seconds_total|kubelet_certificate_manager_client_expiration_renew_errors|kubelet_certificate_manager_client_ttl_seconds|kubelet_certificate_manager_server_ttl_seconds|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_cgroup_manager_duration_seconds_count|kubelet_node_config_error|kubelet_node_name|kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count|kubelet_pleg_relist_interval_seconds_bucket|kubelet_pod_start_duration_seconds_bucket|kubelet_pod_start_duration_seconds_count|kubelet_pod_worker_duration_seconds_bucket|kubelet_pod_worker_duration_seconds_count|kubelet_running_container_count|kubelet_running_containers|kubelet_running_pod_count|kubelet_running_pods|kubelet_runtime_operations_errors_total|kubelet_runtime_operations_total|kubelet_server_expiration_renew_errors|kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_used|kubernetes_build_info|namespace_workload_pod|rest_client_requests_total|storage_operation_duration_seconds_count|storage_operation_errors_total|volume_manager_total_volumes" + action = "keep" + } + forward_to = [prometheus.relabel.metrics_service.receiver] + } + + // Cluster Events + loki.source.kubernetes_events "cluster_events" { + job_name = "integrations/kubernetes/eventhandler" + log_format = "logfmt" + forward_to = [loki.write.endpoint.receiver] + } + + prometheus.relabel "metrics_service" { + forward_to = [prometheus.remote_write.default.receiver] + } + + + // Write Endpoints + // prometheus write endpoint + prometheus.remote_write "default" { + external_labels = { + cluster = "{{ .Values.cluster }}", + project = "{{ .Values.project }}", + } + endpoint { + url = "https://mimir.example.com/api/v1/push" + + headers = { + "X-Scope-OrgID" = "anonymous", + } + + } + } + + // loki write endpoint + loki.write "endpoint" { + external_labels = { + cluster = "{{ .Values.cluster }}", + project = "{{ .Values.project }}", + } + endpoint { + url = "https://loki.example.com/loki/api/v1/push" + } + } \ No newline at end of file diff --git a/helm/ambassador/README.md b/helm/ambassador/README.md index 8f658431..2e684849 100644 --- a/helm/ambassador/README.md +++ b/helm/ambassador/README.md @@ -60,5 +60,3 @@ A Helm chart for deploying ambassador for gen3 | tolerations | list | `[]` | Tolerations to use for the deployment. | | userNamespace | string | `"jupyter-pods"` | Namespace to use for user resources. | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/arborist/README.md b/helm/arborist/README.md index 5ccdc021..74cb57d6 100644 --- a/helm/arborist/README.md +++ b/helm/arborist/README.md @@ -105,5 +105,3 @@ A Helm chart for gen3 arborist | volumeMounts | list | `[]` | Volume mounts to attach to the container | | volumes | list | `[]` | Volumes to attach to the pod | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/argo-wrapper/README.md b/helm/argo-wrapper/README.md index b7b131ec..d6ce7750 100644 --- a/helm/argo-wrapper/README.md +++ b/helm/argo-wrapper/README.md @@ -64,5 +64,3 @@ A Helm chart for gen3 Argo Wrapper Service | volumeMounts | list | `[{"mountPath":"/argo.json","name":"argo-config","readOnly":true,"subPath":"argo.json"}]` | Volumes to mount to the pod. | | volumes | list | `[{"configMap":{"items":[{"key":"argo.json","path":"argo.json"}],"name":"manifest-argo"},"name":"argo-config"}]` | Volumes to attach to the pod. | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/audit/README.md b/helm/audit/README.md index 15ee8891..8d4ffa2c 100644 --- a/helm/audit/README.md +++ b/helm/audit/README.md @@ -123,5 +123,3 @@ A Helm chart for Kubernetes | volumeMounts | list | `[]` | Volumes to mount to the container. | | volumes | list | `[]` | Volumes to attach to the container. | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/aws-es-proxy/README.md b/helm/aws-es-proxy/README.md index 2d5cbba8..873a0e41 100644 --- a/helm/aws-es-proxy/README.md +++ b/helm/aws-es-proxy/README.md @@ -67,5 +67,3 @@ A Helm chart for AWS ES Proxy Service for gen3 | volumeMounts | list | `[{"mountPath":"/root/.aws","name":"credentials","readOnly":true}]` | Volumes to mount to the pod. | | volumes | list | `nil` | Volumes to attach to the pod | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/common/README.md b/helm/common/README.md index 61cc086b..75e6a5d7 100644 --- a/helm/common/README.md +++ b/helm/common/README.md @@ -29,5 +29,3 @@ A Helm chart for provisioning databases in gen3 | global.revproxyArn | string | `"arn:aws:acm:us-east-1:123456:certificate"` | ARN of the reverse proxy certificate. | | global.tierAccessLevel | string | `"libre"` | Access level for tiers. acceptable values for `tier_access_level` are: `libre`, `regular` and `private`. If omitted, by default common will be treated as `private` | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/dicom-server/README.md b/helm/dicom-server/README.md index 7a1fa3a5..f95924f0 100644 --- a/helm/dicom-server/README.md +++ b/helm/dicom-server/README.md @@ -53,5 +53,3 @@ A Helm chart for gen3 Dicom Server | volumeMounts | list | `[{"mountPath":"/etc/orthanc/orthanc_config_overwrites.json","name":"config-volume-g3auto","readOnly":true,"subPath":"orthanc_config_overwrites.json"}]` | Volumes to mount to the pod. | | volumes | list | `[{"name":"config-volume-g3auto","secret":{"secretName":"orthanc-g3auto"}}]` | Volumes to attach to the pod. | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/dicom-viewer/README.md b/helm/dicom-viewer/README.md index e4e6ddb5..28eec517 100644 --- a/helm/dicom-viewer/README.md +++ b/helm/dicom-viewer/README.md @@ -40,5 +40,3 @@ A Helm chart for gen3 Dicom Viewer | service.port | int | `80` | The port number that the service exposes. | | service.type | string | `"ClusterIP"` | Type of service. Valid values are "ClusterIP", "NodePort", "LoadBalancer", "ExternalName". | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/etl/README.md b/helm/etl/README.md index 9c1bac65..9d9640e3 100644 --- a/helm/etl/README.md +++ b/helm/etl/README.md @@ -103,5 +103,3 @@ A Helm chart for gen3 etl | resources.tube.requests.cpu | string | `0.3` | The amount of CPU requested | | resources.tube.requests.memory | string | `"128Mi"` | The amount of memory requested | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/faro-collector/Chart.yaml b/helm/faro-collector/Chart.yaml new file mode 100644 index 00000000..3ebd99ee --- /dev/null +++ b/helm/faro-collector/Chart.yaml @@ -0,0 +1,30 @@ +apiVersion: v2 +name: alloy +description: A Helm chart for deploying Grafana Alloy + +# A chart can be either an 'application' or a 'library' chart. +# +# Application charts are a collection of templates that can be packaged into versioned archives +# to be deployed. +# +# Library charts provide useful utilities or functions for the chart developer. They're included as +# a dependency of application charts to inject those utilities and functions into the rendering +# pipeline. Library charts do not define any templates and therefore cannot be deployed. +type: application + +# This is the chart version. This version number should be incremented each time you make changes +# to the chart and its templates, including the app version. +# Versions are expected to follow Semantic Versioning (https://semver.org/) +version: 0.1.0 + +# This is the version number of the application being deployed. This version number should be +# incremented each time you make changes to the application. Versions are not expected to +# follow Semantic Versioning. They should reflect the version the application is using. +# It is recommended to use it with quotes. +appVersion: "master" + +# Dependencies +dependencies: + - name: alloy + version: "0.9.1" + repository: "https://grafana.github.io/helm-charts" diff --git a/helm/faro-collector/SETUP.md b/helm/faro-collector/SETUP.md new file mode 100644 index 00000000..85f024eb --- /dev/null +++ b/helm/faro-collector/SETUP.md @@ -0,0 +1,154 @@ +# Grafana Alloy and Faro + +## Overview + +This guide provides a step-by-step approach to configuring an Alloy instance to collect Grafana Faro logs sent over the internet, similar to Real User Monitoring (RUM). The Portal service generates Faro logs, which Alloy collects and forwards to Loki for storage and analysis in Grafana. Additionally, this guide explains how to enable metrics in the Fence service and adjust the Faro URL in the Gen3 Portal configuration to route metrics to your Alloy instance. Future updates will enable more Gen3 services to offer metric collection. + +Before deploying Alloy, it is important to first deploy the "observability" Helm chart, as it provides the necessary components and configuration for Alloy to function properly. Please refer to the observability chart documentation for instructions on how to set it up before proceeding with the Alloy deployment. + +### Why Does Faro Require an Internet-Facing Ingress? + +Grafana Faro collects Real User Monitoring (RUM) data, such as performance metrics, errors, and user interactions, via the Fence and Portal services. This data is sent from user devices to the backend, which in this case is Alloy. To enable this communication, an internet-facing ingress is required to expose the Faro endpoint to the public, allowing users' browsers to send RUM data to the Alloy instance over the internet. + +## Configuring Alloy for Faro Logs + +### Helm Chart Configuration + +The ingress is configured with AWS ALB (Application Load Balancer) to expose the Alloy Faro port (12347) to the internet. The alb.ingress.kubernetes.io/scheme annotation ensures that the ALB is internet-facing, allowing users to send logs from their browsers to Alloy. + +When configuring the Faro collector, you will need to update the hosts section of the values.yaml file to match the hostname you plan to use for the Faro collector. For example, replace "faro.example.com" with your desired hostname. + +Additionally, it is highly recommended that you uncomment and adjust the annotations provided for AWS ALB (Application Load Balancer) to fit your environment. These annotations will help ensure proper configuration of the load balancer, SSL certificates, and other key settings. For instance, make sure to replace the placeholder values such as "cert arn", "ssl policy", and "environment name" with your specific details. + +```yaml +alloy: + extraPorts: + - name: "faro" + port: 12347 + targetPort: 12347 + protocol: "TCP" + clustering: + enabled: true + configMap: + name: alloy-gen3 + key: config + +ingress: + enabled: true + ingressClassName: "alb" + annotations: + alb.ingress.kubernetes.io/certificate-arn: + alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]' + alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=600 + alb.ingress.kubernetes.io/scheme: internet-facing + alb.ingress.kubernetes.io/ssl-policy: + alb.ingress.kubernetes.io/ssl-redirect: '443' + alb.ingress.kubernetes.io/tags: Environment= + alb.ingress.kubernetes.io/target-type: ip + labels: {} + path: / + faroPort: 12347 + hosts: + - faro.example.com + +alloy-configmap-data: | + logging { + level = "info" + format = "json" + } + + otelcol.exporter.otlp "tempo" { + client { + endpoint = "http://grafana-tempo-distributor.monitoring:4317" + tls { + insecure = true + insecure_skip_verify = true + } + } + } + + loki.write "endpoint" { + endpoint { + url = "http://grafana-loki-gateway.monitoring:80/loki/api/v1/push" + } + } + + faro.receiver "default" { + server { + listen_address = "0.0.0.0" + listen_port = 12347 + cors_allowed_origins = ["*"] + } + + extra_log_labels = { + service = "frontend-app", + app_name = "", + app_environment = "", + app_namespace = "", + app_version = "", + } + output { + logs = [loki.write.endpoint.receiver] + traces = [otelcol.exporter.otlp.tempo.input] + } + } +``` + +### Helm Chart Links +The link below will take you to the Grafana Alloy chart, providing a comprehensive list of configurable options to help you further customize your setup. + +[Alloy Helm Chart](https://github.com/grafana/alloy/blob/main/operations/helm/charts/alloy/values.yaml) + +--- + +## Enabling Faro Metrics in Fence + +Fence now has built-in Faro metrics. To enable these metrics, you must update your Fence deployment. + +*** Note: you must be using Fence version 10.2.0 or later + +### Step 1: Enable Prometheus Metrics in the Fence Pod + +Update your Fence deployment with the following annotations to allow Prometheus to scrape the metrics: + +```yaml +fence: + podAnnotations: + prometheus.io/path: /metrics + prometheus.io/scrape: "true" +``` + +### Step 2: Enable Metrics in the Fence Configuration + +Modify the FENCE_CONFIG_PUBLIC section to enable Prometheus metrics: + +```yaml +fence: + FENCE_CONFIG_PUBLIC: + ENABLE_PROMETHEUS_METRICS: true + ENABLE_DB_MIGRATION: true +``` + +--- + +## Updating Faro URL in Gen3 Portal + +If you need to change the Faro URL that metrics are sent to, you will need to update the "grafanaFaroUrl" field by modifying the "gitops.json" value in your values.yaml. You can refer to [this link](https://github.com/uc-cdis/data-portal/blob/master/docs/portal_config.md) for more information. + +```yaml +portal: + # -- (map) GitOps configuration for portal + gitops: + # -- (string) multiline string - gitops.json + json: | + { + "grafanaFaroConfig": { + "grafanaFaroEnable": true, // optional; flag to turn on Grafana Faro RUM, default to false + "grafanaFaroNamespace": "DEV", // optional; the Grafana Faro RUM option specifying the application’s namespace, for example: prod, pre-prod, staging, etc. Can be determined automatically if omitted. But it is highly recommended to customize it to include project information, such as 'healprod' + "grafanaFaroUrl": "", // optional: the Grafana Faro collector url. Defaults to https://faro.example.com/collect + "grafanaFaroSampleRate": 1, // optional; numeric; the Grafana Faro option specifying the percentage of sessions to track: 1 for all, 0 for none. Default to 1 if omitted + }, +``` +--- + +By following this guide, you'll have successfully set up Alloy to receive Grafana Faro logs and metrics while exposing the service over the internet using Kubernetes ingress. You’ll also be able to monitor Faro metrics through Fence and make necessary configurations in Gen3 Portal for seamless Faro integration. \ No newline at end of file diff --git a/helm/faro-collector/templates/alloy-config.yaml b/helm/faro-collector/templates/alloy-config.yaml new file mode 100644 index 00000000..0bf02875 --- /dev/null +++ b/helm/faro-collector/templates/alloy-config.yaml @@ -0,0 +1,9 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: alloy-gen3 +data: + config: | + {{- with .Values.alloy.alloyConfigmapData }} + {{- toYaml . | nindent 4 }} + {{ end }} \ No newline at end of file diff --git a/helm/faro-collector/values.yaml b/helm/faro-collector/values.yaml new file mode 100644 index 00000000..90326bc9 --- /dev/null +++ b/helm/faro-collector/values.yaml @@ -0,0 +1,77 @@ +alloy: + alloy: + extraPorts: + - name: "faro" + port: 12347 + targetPort: 12347 + protocol: "TCP" + clustering: + enabled: true + configMap: + name: alloy-gen3 + key: config + + ingress: + # -- Enables ingress for Alloy (Faro port) + enabled: true + # For Kubernetes >= 1.18 you should specify the ingress-controller via the field ingressClassName + # See https://kubernetes.io/blog/2020/04/02/improvements-to-the-ingress-api-in-kubernetes-1.18/#specifying-the-class-of-an-ingress + ingressClassName: "alb" + annotations: {} + ## Recommended annotations for AWS ALB (Application Load Balancer). + # alb.ingress.kubernetes.io/certificate-arn: + # alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]' + # alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=600 + # alb.ingress.kubernetes.io/scheme: internet-facing + # alb.ingress.kubernetes.io/ssl-policy: + # alb.ingress.kubernetes.io/ssl-redirect: '443' + # alb.ingress.kubernetes.io/tags: Environment= + # alb.ingress.kubernetes.io/target-type: ip + labels: {} + path: / + faroPort: 12347 + hosts: + - faro.example.com + + alloyConfigmapData: | + logging { + level = "info" + format = "json" + } + + otelcol.exporter.otlp "tempo" { + client { + endpoint = "http://grafana-tempo-distributor.monitoring:4317" + tls { + insecure = true + insecure_skip_verify = true + } + } + } + + // loki write endpoint + loki.write "endpoint" { + endpoint { + url = "http://grafana-loki-gateway.monitoring:80/loki/api/v1/push" + } + } + + faro.receiver "default" { + server { + listen_address = "0.0.0.0" + listen_port = 12347 + cors_allowed_origins = ["*"] + } + + extra_log_labels = { + service = "frontend-app", + app_name = "", + app_environment = "", + app_namespace = "", + app_version = "", + } + output { + logs = [loki.write.endpoint.receiver] + traces = [otelcol.exporter.otlp.tempo.input] + } + } \ No newline at end of file diff --git a/helm/fence/README.md b/helm/fence/README.md index 25586f37..ea03a462 100644 --- a/helm/fence/README.md +++ b/helm/fence/README.md @@ -199,5 +199,3 @@ A Helm chart for gen3 Fence | volumeMounts | list | `[{"mountPath":"/var/www/fence/local_settings.py","name":"old-config-volume","readOnly":true,"subPath":"local_settings.py"},{"mountPath":"/var/www/fence/fence_credentials.json","name":"json-secret-volume","readOnly":true,"subPath":"fence_credentials.json"},{"mountPath":"/var/www/fence/creds.json","name":"creds-volume","readOnly":true,"subPath":"creds.json"},{"mountPath":"/var/www/fence/config_helper.py","name":"config-helper","readOnly":true,"subPath":"config_helper.py"},{"mountPath":"/fence/fence/static/img/logo.svg","name":"logo-volume","readOnly":true,"subPath":"logo.svg"},{"mountPath":"/fence/fence/static/privacy_policy.md","name":"privacy-policy","readOnly":true,"subPath":"privacy_policy.md"},{"mountPath":"/var/www/fence/fence-config.yaml","name":"config-volume","readOnly":true,"subPath":"fence-config.yaml"},{"mountPath":"/var/www/fence/yaml_merge.py","name":"yaml-merge","readOnly":true,"subPath":"yaml_merge.py"},{"mountPath":"/var/www/fence/fence_google_app_creds_secret.json","name":"fence-google-app-creds-secret-volume","readOnly":true,"subPath":"fence_google_app_creds_secret.json"},{"mountPath":"/var/www/fence/fence_google_storage_creds_secret.json","name":"fence-google-storage-creds-secret-volume","readOnly":true,"subPath":"fence_google_storage_creds_secret.json"},{"mountPath":"/fence/keys/key/jwt_private_key.pem","name":"fence-jwt-keys","readOnly":true,"subPath":"jwt_private_key.pem"}]` | Volumes to mount to the container. | | volumes | list | `[{"name":"old-config-volume","secret":{"secretName":"fence-secret"}},{"name":"json-secret-volume","secret":{"optional":true,"secretName":"fence-json-secret"}},{"name":"creds-volume","secret":{"secretName":"fence-creds"}},{"configMap":{"name":"config-helper","optional":true},"name":"config-helper"},{"configMap":{"name":"logo-config"},"name":"logo-volume"},{"name":"config-volume","secret":{"secretName":"fence-config"}},{"name":"fence-google-app-creds-secret-volume","secret":{"secretName":"fence-google-app-creds-secret"}},{"name":"fence-google-storage-creds-secret-volume","secret":{"secretName":"fence-google-storage-creds-secret"}},{"name":"fence-jwt-keys","secret":{"secretName":"fence-jwt-keys"}},{"configMap":{"name":"privacy-policy"},"name":"privacy-policy"},{"configMap":{"name":"fence-yaml-merge","optional":true},"name":"yaml-merge"}]` | Volumes to attach to the container. | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/frontend-framework/README.md b/helm/frontend-framework/README.md index 2f264174..8c515bb3 100644 --- a/helm/frontend-framework/README.md +++ b/helm/frontend-framework/README.md @@ -93,5 +93,3 @@ A Helm chart for the gen3 frontend framework | strategy.rollingUpdate.maxUnavailable | int | `"25%"` | Maximum amount of pods that can be unavailable during the update. | | tolerations | list | `[]` | Tolerations to apply to the pod | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/gen3/README.md b/helm/gen3/README.md index e94e4a43..7be42ca8 100644 --- a/helm/gen3/README.md +++ b/helm/gen3/README.md @@ -164,5 +164,3 @@ Helm chart to deploy Gen3 Data Commons | ssjdispatcher.enabled | bool | `false` | Whether to deploy the ssjdispatcher subchart. | | wts.enabled | bool | `true` | Whether to deploy the wts subchart. | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/guppy/README.md b/helm/guppy/README.md index b872e6d4..7cf3ec1c 100644 --- a/helm/guppy/README.md +++ b/helm/guppy/README.md @@ -96,5 +96,3 @@ A Helm chart for gen3 Guppy Service | volumeMounts | list | `[{"mountPath":"/guppy/guppy_config.json","name":"guppy-config","readOnly":true,"subPath":"guppy_config.json"}]` | Volumes to mount to the container. | | volumes | list | `[{"configMap":{"items":[{"key":"guppy_config.json","path":"guppy_config.json"}],"name":"manifest-guppy"},"name":"guppy-config"}]` | Volumes to attach to the pod. | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/hatchery/README.md b/helm/hatchery/README.md index 4bce3c89..3ebadfc2 100644 --- a/helm/hatchery/README.md +++ b/helm/hatchery/README.md @@ -86,5 +86,3 @@ A Helm chart for gen3 Hatchery | volumeMounts | list | `[{"mountPath":"/hatchery.json","name":"hatchery-config","readOnly":true,"subPath":"json"}]` | Volumes to mount to the container. | | volumes | list | `[{"configMap":{"name":"manifest-hatchery"},"name":"hatchery-config"}]` | Volumes to attach to the container. | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/indexd/README.md b/helm/indexd/README.md index 2f81977e..8d7057cb 100644 --- a/helm/indexd/README.md +++ b/helm/indexd/README.md @@ -107,5 +107,3 @@ A Helm chart for gen3 indexd | volumeMounts | list | `[{"mountPath":"/var/www/indexd/local_settings.py","name":"config-volume","readOnly":true,"subPath":"local_settings.py"}]` | Volumes to mount to the container. | | volumes | list | `[{"configMap":{"name":"indexd-uwsgi"},"name":"uwsgi-config"},{"name":"config-volume","secret":{"secretName":"indexd-settings"}}]` | Volumes to attach to the pod | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/lgtm-distributed/Chart.yaml b/helm/lgtm-distributed/Chart.yaml deleted file mode 100644 index 1c6e5d9a..00000000 --- a/helm/lgtm-distributed/Chart.yaml +++ /dev/null @@ -1,62 +0,0 @@ ---- -apiVersion: v2 -name: lgtm-distributed -description: Umbrella chart for a distributed Loki, Grafana, Tempo and Mimir stack -type: application -version: 1.0.1 -appVersion: "6.59.4" - -home: https://grafana.com/oss/ -icon: https://artifacthub.io/image/b4fed1a7-6c8f-4945-b99d-096efa3e4116 - -sources: - - https://grafana.github.io/helm-charts - - https://github.com/grafana/grafana - - https://github.com/grafana/loki - - https://github.com/grafana/mimir - - https://github.com/grafana/tempo - -keywords: - - monitoring - - traces - - metrics - - logs - -annotations: - "artifacthub.io/license": Apache-2.0 - "artifacthub.io/links": | - - name: Chart Source - url: https://github.com/grafana/helm-charts - - name: Grafana - url: https://github.com/grafana/grafana - - name: Loki - url: https://github.com/grafana/loki - - name: Mimir - url: https://github.com/grafana/mimir - - name: Tempo - url: https://github.com/grafana/tempo - -maintainers: - - name: timberhill - -dependencies: - - name: grafana - alias: grafana - condition: grafana.enabled - repository: https://grafana.github.io/helm-charts - version: "^7.3.9" - - name: loki-distributed - alias: loki - condition: loki.enabled - repository: "https://grafana.github.io/helm-charts" - version: "^0.74.3" - - name: mimir-distributed - alias: mimir - condition: mimir.enabled - repository: "https://grafana.github.io/helm-charts" - version: "^5.3.0" - - name: tempo-distributed - alias: tempo - condition: tempo.enabled - repository: "https://grafana.github.io/helm-charts" - version: "^1.9.9" diff --git a/helm/lgtm-distributed/README.md b/helm/lgtm-distributed/README.md deleted file mode 100644 index 2c00f225..00000000 --- a/helm/lgtm-distributed/README.md +++ /dev/null @@ -1,233 +0,0 @@ -# lgtm-distributed - -![Version: 1.0.1](https://img.shields.io/badge/Version-1.0.1-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 6.59.4](https://img.shields.io/badge/AppVersion-6.59.4-informational?style=flat-square) - -Umbrella chart for a distributed Loki, Grafana, Tempo and Mimir stack - -**Homepage:** - -## Maintainers - -| Name | Email | Url | -| ---- | ------ | --- | -| timberhill | | | - -## Source Code - -* -* -* -* -* - -## Requirements - -| Repository | Name | Version | -|------------|------|---------| -| https://grafana.github.io/helm-charts | grafana(grafana) | ^7.3.9 | -| https://grafana.github.io/helm-charts | loki(loki-distributed) | ^0.74.3 | -| https://grafana.github.io/helm-charts | mimir(mimir-distributed) | ^5.3.0 | -| https://grafana.github.io/helm-charts | tempo(tempo-distributed) | ^1.9.9 | - -## Values - -| Key | Type | Default | Description | -|-----|------|---------|-------------| -| grafana.alerting."contactpoints.yaml".secret.apiVersion | int | `1` | | -| grafana.alerting."contactpoints.yaml".secret.contactPoints[0].name | string | `"slack"` | | -| grafana.alerting."contactpoints.yaml".secret.contactPoints[0].orgId | int | `1` | | -| grafana.alerting."contactpoints.yaml".secret.contactPoints[0].receivers[0].settings.group | string | `"slack"` | | -| grafana.alerting."contactpoints.yaml".secret.contactPoints[0].receivers[0].settings.summary | string | `"{{ `{{ include \"default.message\" . }}` }}\n"` | | -| grafana.alerting."contactpoints.yaml".secret.contactPoints[0].receivers[0].settings.url | string | `"https://hooks.slack.com/services/XXXXXXXXXX"` | | -| grafana.alerting."contactpoints.yaml".secret.contactPoints[0].receivers[0].type | string | `"Slack"` | | -| grafana.alerting."contactpoints.yaml".secret.contactPoints[0].receivers[0].uid | string | `"first_uid"` | | -| grafana.alerting."rules.yaml".apiVersion | int | `1` | | -| grafana.alerting."rules.yaml".groups[0].folder | string | `"Alerts"` | | -| grafana.alerting."rules.yaml".groups[0].interval | string | `"5m"` | | -| grafana.alerting."rules.yaml".groups[0].name | string | `"Alerts"` | | -| grafana.alerting."rules.yaml".groups[0].orgId | int | `1` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].annotations.summary | string | `"Alert: HTTP 500 errors detected in the environment: {{`{{ $labels.clusters }}`}}"` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].condition | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].data[0].datasourceUid | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].data[0].model.datasource.type | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].data[0].model.datasource.uid | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].data[0].model.editorMode | string | `"code"` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].data[0].model.expr | string | `"sum by (cluster) (count_over_time({cluster=~\".+\"} | json | http_status_code=\"500\" [1h])) > 0"` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].data[0].model.hide | bool | `false` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].data[0].model.intervalMs | int | `1000` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].data[0].model.maxDataPoints | int | `43200` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].data[0].model.queryType | string | `"instant"` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].data[0].model.refId | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].data[0].queryType | string | `"instant"` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].data[0].refId | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].data[0].relativeTimeRange.from | int | `600` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].data[0].relativeTimeRange.to | int | `0` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].execErrState | string | `"KeepLast"` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].for | string | `"5m"` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].isPaused | bool | `false` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].labels | object | `{}` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].noDataState | string | `"OK"` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].notification_settings.receiver | string | `"Slack"` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].title | string | `"HTTP 500 errors detected"` | | -| grafana.alerting."rules.yaml".groups[0].rules[0].uid | string | `"edwb8zgcvq96oc"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].annotations.description | string | `"Error in usersync job detected in cluster {{`{{ $labels.clusters }}`}}, namespace {{`{{ $labels.namespace }}`}}."` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].annotations.summary | string | `"Error Logs Detected in Usersync Job"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].condition | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].data[0].datasourceUid | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].data[0].model.datasource.type | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].data[0].model.datasource.uid | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].data[0].model.editorMode | string | `"code"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].data[0].model.expr | string | `"sum by (cluster, namespace) (count_over_time({ app=\"gen3job\", job_name=~\"usersync-.*\"} |= \"ERROR - could not revoke policies from user `N/A`\" [5m])) > 1"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].data[0].model.hide | bool | `false` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].data[0].model.intervalMs | int | `1000` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].data[0].model.maxDataPoints | int | `43200` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].data[0].model.queryType | string | `"instant"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].data[0].model.refId | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].data[0].queryType | string | `"instant"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].data[0].refId | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].data[0].relativeTimeRange.from | int | `600` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].data[0].relativeTimeRange.to | int | `0` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].execErrState | string | `"KeepLast"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].for | string | `"5m"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].isPaused | bool | `false` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].labels | object | `{}` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].noDataState | string | `"OK"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].notification_settings.receiver | string | `"Slack"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].title | string | `"Error Logs Detected in Usersync Job"` | | -| grafana.alerting."rules.yaml".groups[0].rules[1].uid | string | `"adwb9vhb7irr4b"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].annotations.description | string | `"Panic detected in app {{`{{ $labels.app }}`}} within cluster {{`{{ $labels.clusters }}`}}."` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].annotations.summary | string | `"Hatchery panic"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].condition | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].data[0].datasourceUid | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].data[0].model.datasource.type | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].data[0].model.datasource.uid | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].data[0].model.editorMode | string | `"code"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].data[0].model.expr | string | `"sum by (cluster) (count_over_time({app=\"hatchery\"} |= \"panic\" [5m])) > 1"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].data[0].model.hide | bool | `false` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].data[0].model.intervalMs | int | `1000` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].data[0].model.maxDataPoints | int | `43200` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].data[0].model.queryType | string | `"instant"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].data[0].model.refId | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].data[0].queryType | string | `"instant"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].data[0].refId | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].data[0].relativeTimeRange.from | int | `600` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].data[0].relativeTimeRange.to | int | `0` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].execErrState | string | `"KeepLast"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].for | string | `"5m"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].isPaused | bool | `false` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].labels | object | `{}` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].noDataState | string | `"OK"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].notification_settings.receiver | string | `"Slack"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].title | string | `"Hatchery panic in {{`{{ env.name }}`}}"` | | -| grafana.alerting."rules.yaml".groups[0].rules[2].uid | string | `"ddwbc12l6wc8wf"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].annotations.description | string | `"Detected 431 HTTP status codes in the logs within the last 5 minutes."` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].annotations.summary | string | `"Http status code 431"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].condition | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].data[0].datasourceUid | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].data[0].model.datasource.type | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].data[0].model.datasource.uid | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].data[0].model.editorMode | string | `"code"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].data[0].model.expr | string | `"sum(count_over_time({cluster=~\".+\"} | json | http_status_code=\"431\" [5m])) >= 2"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].data[0].model.hide | bool | `false` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].data[0].model.intervalMs | int | `1000` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].data[0].model.maxDataPoints | int | `43200` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].data[0].model.queryType | string | `"instant"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].data[0].model.refId | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].data[0].queryType | string | `"instant"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].data[0].refId | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].data[0].relativeTimeRange.from | int | `600` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].data[0].relativeTimeRange.to | int | `0` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].execErrState | string | `"KeepLast"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].for | string | `"5m"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].isPaused | bool | `false` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].labels | object | `{}` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].noDataState | string | `"OK"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].notification_settings.receiver | string | `"Slack"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].title | string | `"Http status code 431"` | | -| grafana.alerting."rules.yaml".groups[0].rules[3].uid | string | `"cdwbcbphz1zb4a"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].annotations.description | string | `"High number of info status logs detected in the indexd service in cluster {{`{{ $labels.clusters }}`}}."` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].annotations.summary | string | `"Indexd is getting an excessive amount of traffic"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].condition | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].data[0].datasourceUid | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].data[0].model.datasource.type | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].data[0].model.datasource.uid | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].data[0].model.editorMode | string | `"code"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].data[0].model.expr | string | `"sum by (cluster) (count_over_time({cluster=~\".+\", app=\"indexd\", status=\"info\"} [5m])) > 50000"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].data[0].model.hide | bool | `false` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].data[0].model.intervalMs | int | `1000` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].data[0].model.maxDataPoints | int | `43200` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].data[0].model.queryType | string | `"instant"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].data[0].model.refId | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].data[0].queryType | string | `"instant"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].data[0].refId | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].data[0].relativeTimeRange.from | int | `600` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].data[0].relativeTimeRange.to | int | `0` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].execErrState | string | `"KeepLast"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].for | string | `"5m"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].isPaused | bool | `false` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].labels | object | `{}` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].noDataState | string | `"OK"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].notification_settings.receiver | string | `"Slack"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].title | string | `"Indexd is getting an excessive amount of traffic"` | | -| grafana.alerting."rules.yaml".groups[0].rules[4].uid | string | `"bdwbck1lgwdfka"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].annotations.description | string | `"More than 10 errors detected in the karpenter namespace in cluster {{`{{ $labels.clusters }}`}} related to providerRef not found."` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].annotations.summary | string | `"Karpenter Resource Mismatch"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].condition | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].data[0].datasourceUid | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].data[0].model.datasource.type | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].data[0].model.datasource.uid | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].data[0].model.editorMode | string | `"code"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].data[0].model.expr | string | `"sum by (cluster) (count_over_time({namespace=\"karpenter\", cluster=~\".+\"} |= \"ERROR\" |= \"not found\" |= \"getting providerRef\" [5m])) > 10\n"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].data[0].model.hide | bool | `false` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].data[0].model.intervalMs | int | `1000` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].data[0].model.maxDataPoints | int | `43200` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].data[0].model.queryType | string | `"instant"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].data[0].model.refId | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].data[0].queryType | string | `"instant"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].data[0].refId | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].data[0].relativeTimeRange.from | int | `600` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].data[0].relativeTimeRange.to | int | `0` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].execErrState | string | `"KeepLast"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].for | string | `"5m"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].isPaused | bool | `false` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].labels | object | `{}` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].noDataState | string | `"OK"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].notification_settings.receiver | string | `"Slack"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].title | string | `"Karpenter Resource Mismatch"` | | -| grafana.alerting."rules.yaml".groups[0].rules[5].uid | string | `"fdwbe5t439zpcd"` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].annotations.description | string | `"More than 1000 \"limiting requests, excess\" errors detected in service {{`{{ $labels.app }}`}} (cluster: {{`{{ $labels.clusters }}`}}) within the last 5 minutes."` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].annotations.summary | string | `"Nginx is logging excessive \" limiting requests, excess:\""` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].condition | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].data[0].datasourceUid | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].data[0].model.datasource.type | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].data[0].model.datasource.uid | string | `"loki"` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].data[0].model.editorMode | string | `"code"` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].data[0].model.expr | string | `"sum by (app, cluster) (count_over_time({app=~\".+\", cluster=~\".+\"} |= \"status:error\" |= \"limiting requests, excess:\" [5m])) > 1000"` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].data[0].model.hide | bool | `false` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].data[0].model.intervalMs | int | `1000` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].data[0].model.maxDataPoints | int | `43200` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].data[0].model.queryType | string | `"instant"` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].data[0].model.refId | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].data[0].queryType | string | `"instant"` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].data[0].refId | string | `"A"` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].data[0].relativeTimeRange.from | int | `600` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].data[0].relativeTimeRange.to | int | `0` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].execErrState | string | `"KeepLast"` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].for | string | `"5m"` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].isPaused | bool | `false` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].labels | object | `{}` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].noDataState | string | `"OK"` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].notification_settings.receiver | string | `"Slack"` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].title | string | `"Nginx is logging excessive \" limiting requests, excess:\""` | | -| grafana.alerting."rules.yaml".groups[0].rules[6].uid | string | `"fdwbeuftc7400c"` | | -| grafana.datasources | object | `{"datasources.yaml":{"apiVersion":1,"datasources":[{"isDefault":false,"name":"Loki","type":"loki","uid":"loki","url":"http://{{ .Release.Name }}-loki-gateway"},{"isDefault":true,"name":"Mimir","type":"prometheus","uid":"prom","url":"http://{{ .Release.Name }}-mimir-nginx/prometheus"},{"isDefault":false,"jsonData":{"lokiSearch":{"datasourceUid":"loki"},"serviceMap":{"datasourceUid":"prom"},"tracesToLogsV2":{"datasourceUid":"loki"},"tracesToMetrics":{"datasourceUid":"prom"}},"name":"Tempo","type":"tempo","uid":"tempo","url":"http://{{ .Release.Name }}-tempo-query-frontend:3100"}]}}` | Grafana data sources config. Connects to all three by default | -| grafana.datasources."datasources.yaml".datasources | list | `[{"isDefault":false,"name":"Loki","type":"loki","uid":"loki","url":"http://{{ .Release.Name }}-loki-gateway"},{"isDefault":true,"name":"Mimir","type":"prometheus","uid":"prom","url":"http://{{ .Release.Name }}-mimir-nginx/prometheus"},{"isDefault":false,"jsonData":{"lokiSearch":{"datasourceUid":"loki"},"serviceMap":{"datasourceUid":"prom"},"tracesToLogsV2":{"datasourceUid":"loki"},"tracesToMetrics":{"datasourceUid":"prom"}},"name":"Tempo","type":"tempo","uid":"tempo","url":"http://{{ .Release.Name }}-tempo-query-frontend:3100"}]` | Datasources linked to the Grafana instance. Override if you disable any components. | -| grafana.enabled | bool | `true` | Deploy Grafana if enabled. See [upstream readme](https://github.com/grafana/helm-charts/tree/main/charts/grafana#configuration) for full values reference. | -| loki.enabled | bool | `true` | Deploy Loki if enabled. See [upstream readme](https://github.com/grafana/helm-charts/tree/main/charts/loki-distributed#values) for full values reference. | -| mimir | object | `{"alertmanager":{"resources":{"requests":{"cpu":"20m"}}},"compactor":{"resources":{"requests":{"cpu":"20m"}}},"distributor":{"resources":{"requests":{"cpu":"20m"}}},"enabled":true,"ingester":{"replicas":2,"resources":{"requests":{"cpu":"20m"}},"zoneAwareReplication":{"enabled":false}},"minio":{"resources":{"requests":{"cpu":"20m"}}},"overrides_exporter":{"resources":{"requests":{"cpu":"20m"}}},"querier":{"replicas":1,"resources":{"requests":{"cpu":"20m"}}},"query_frontend":{"resources":{"requests":{"cpu":"20m"}}},"query_scheduler":{"replicas":1,"resources":{"requests":{"cpu":"20m"}}},"rollout_operator":{"resources":{"requests":{"cpu":"20m"}}},"ruler":{"resources":{"requests":{"cpu":"20m"}}},"store_gateway":{"resources":{"requests":{"cpu":"20m"}},"zoneAwareReplication":{"enabled":false}}}` | Mimir chart values. Resources are set to a minimum by default. | -| mimir.enabled | bool | `true` | Deploy Mimir if enabled. See [upstream values.yaml](https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml) for full values reference. | -| tempo.enabled | bool | `true` | Deploy Tempo if enabled. See [upstream readme](https://github.com/grafana/helm-charts/blob/main/charts/tempo-distributed/README.md#values) for full values reference. | -| tempo.ingester.replicas | int | `3` | | - ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/lgtm-distributed/templates/NOTES.txt b/helm/lgtm-distributed/templates/NOTES.txt deleted file mode 100644 index 482f35c8..00000000 --- a/helm/lgtm-distributed/templates/NOTES.txt +++ /dev/null @@ -1 +0,0 @@ -Release name should be limited to 25 characters to not exceed the resource name limits of 63 characters. diff --git a/helm/lgtm-distributed/templates/_helpers.tpl b/helm/lgtm-distributed/templates/_helpers.tpl deleted file mode 100644 index 4c1d430f..00000000 --- a/helm/lgtm-distributed/templates/_helpers.tpl +++ /dev/null @@ -1,18 +0,0 @@ - {{/* -Create a default fully qualified app name without trimming it at all. -If release name contains chart name it will be used as a full name. -This value is essentially the same as "mimir.fullname" in the upstream chart. -*/}} -{{- define "mimir.fullname" -}} -{{- if .Values.mimir.fullnameOverride -}} -{{- .Values.mimir.fullnameOverride | trunc 25 | trimSuffix "-" -}} -{{- else -}} -{{- $name := .Values.mimir.nameOverride | default ( include "mimir.infixName" . ) | trunc 25 | trimSuffix "-" -}} -{{- $releasename := .Release.Name | trunc 25 | trimSuffix "-" -}} -{{- if contains $name .Release.Name -}} -{{- $releasename -}} -{{- else -}} -{{- printf "%s-%s" $releasename $name -}} -{{- end -}} -{{- end -}} -{{- end -}} diff --git a/helm/lgtm-distributed/values.yaml b/helm/lgtm-distributed/values.yaml deleted file mode 100644 index 24a0a422..00000000 --- a/helm/lgtm-distributed/values.yaml +++ /dev/null @@ -1,352 +0,0 @@ ---- -grafana: - # -- Deploy Grafana if enabled. See [upstream readme](https://github.com/grafana/helm-charts/tree/main/charts/grafana#configuration) for full values reference. - enabled: true - - # -- Grafana data sources config. Connects to all three by default - datasources: - datasources.yaml: - apiVersion: 1 - # -- Datasources linked to the Grafana instance. Override if you disable any components. - datasources: - # https://grafana.com/docs/grafana/latest/datasources/loki/#provision-the-loki-data-source - - name: Loki - uid: loki - type: loki - url: http://{{ .Release.Name }}-loki-gateway - isDefault: false - # https://grafana.com/docs/grafana/latest/datasources/prometheus/#provision-the-data-source - - name: Mimir - uid: prom - type: prometheus - url: http://{{ .Release.Name }}-mimir-nginx/prometheus - isDefault: true - # https://grafana.com/docs/grafana/latest/datasources/tempo/configure-tempo-data-source/#provision-the-data-source - - name: Tempo - uid: tempo - type: tempo - url: http://{{ .Release.Name }}-tempo-query-frontend:3100 - isDefault: false - jsonData: - tracesToLogsV2: - datasourceUid: loki - lokiSearch: - datasourceUid: loki - tracesToMetrics: - datasourceUid: prom - serviceMap: - datasourceUid: prom - - - alerting: - rules.yaml: - apiVersion: 1 - groups: - - orgId: 1 - name: Alerts - folder: Alerts - interval: 5m - rules: - - uid: edwb8zgcvq96oc - title: HTTP 500 errors detected - condition: A - data: - - refId: A - queryType: instant - relativeTimeRange: - from: 600 - to: 0 - datasourceUid: loki - model: - datasource: - type: loki - uid: loki - editorMode: code - expr: sum by (cluster) (count_over_time({cluster=~".+"} | json | http_status_code="500" [1h])) > 0 - hide: false - intervalMs: 1000 - maxDataPoints: 43200 - queryType: instant - refId: A - noDataState: OK - execErrState: KeepLast - for: 5m - annotations: - summary: 'Alert: HTTP 500 errors detected in the environment: {{`{{ $labels.clusters }}`}}' - labels: {} - isPaused: false - notification_settings: - receiver: Slack - - uid: adwb9vhb7irr4b - title: Error Logs Detected in Usersync Job - condition: A - data: - - refId: A - queryType: instant - relativeTimeRange: - from: 600 - to: 0 - datasourceUid: loki - model: - datasource: - type: loki - uid: loki - editorMode: code - expr: sum by (cluster, namespace) (count_over_time({ app="gen3job", job_name=~"usersync-.*"} |= "ERROR - could not revoke policies from user `N/A`" [5m])) > 1 - hide: false - intervalMs: 1000 - maxDataPoints: 43200 - queryType: instant - refId: A - noDataState: OK - execErrState: KeepLast - for: 5m - annotations: - description: Error in usersync job detected in cluster {{`{{ $labels.clusters }}`}}, namespace {{`{{ $labels.namespace }}`}}. - summary: Error Logs Detected in Usersync Job - labels: {} - isPaused: false - notification_settings: - receiver: Slack - - uid: ddwbc12l6wc8wf - title: Hatchery panic in {{`{{ env.name }}`}} - condition: A - data: - - refId: A - queryType: instant - relativeTimeRange: - from: 600 - to: 0 - datasourceUid: loki - model: - datasource: - type: loki - uid: loki - editorMode: code - expr: sum by (cluster) (count_over_time({app="hatchery"} |= "panic" [5m])) > 1 - hide: false - intervalMs: 1000 - maxDataPoints: 43200 - queryType: instant - refId: A - noDataState: OK - execErrState: KeepLast - for: 5m - annotations: - description: Panic detected in app {{`{{ $labels.app }}`}} within cluster {{`{{ $labels.clusters }}`}}. - summary: Hatchery panic - labels: {} - isPaused: false - notification_settings: - receiver: Slack - - uid: cdwbcbphz1zb4a - title: Http status code 431 - condition: A - data: - - refId: A - queryType: instant - relativeTimeRange: - from: 600 - to: 0 - datasourceUid: loki - model: - datasource: - type: loki - uid: loki - editorMode: code - expr: sum(count_over_time({cluster=~".+"} | json | http_status_code="431" [5m])) >= 2 - hide: false - intervalMs: 1000 - maxDataPoints: 43200 - queryType: instant - refId: A - noDataState: OK - execErrState: KeepLast - for: 5m - annotations: - description: Detected 431 HTTP status codes in the logs within the last 5 minutes. - summary: Http status code 431 - labels: {} - isPaused: false - notification_settings: - receiver: Slack - - uid: bdwbck1lgwdfka - title: Indexd is getting an excessive amount of traffic - condition: A - data: - - refId: A - queryType: instant - relativeTimeRange: - from: 600 - to: 0 - datasourceUid: loki - model: - datasource: - type: loki - uid: loki - editorMode: code - expr: sum by (cluster) (count_over_time({cluster=~".+", app="indexd", status="info"} [5m])) > 50000 - hide: false - intervalMs: 1000 - maxDataPoints: 43200 - queryType: instant - refId: A - noDataState: OK - execErrState: KeepLast - for: 5m - annotations: - description: High number of info status logs detected in the indexd service in cluster {{`{{ $labels.clusters }}`}}. - summary: Indexd is getting an excessive amount of traffic - labels: {} - isPaused: false - notification_settings: - receiver: Slack - - uid: fdwbe5t439zpcd - title: Karpenter Resource Mismatch - condition: A - data: - - refId: A - queryType: instant - relativeTimeRange: - from: 600 - to: 0 - datasourceUid: loki - model: - datasource: - type: loki - uid: loki - editorMode: code - expr: | - sum by (cluster) (count_over_time({namespace="karpenter", cluster=~".+"} |= "ERROR" |= "not found" |= "getting providerRef" [5m])) > 10 - hide: false - intervalMs: 1000 - maxDataPoints: 43200 - queryType: instant - refId: A - noDataState: OK - execErrState: KeepLast - for: 5m - annotations: - description: More than 10 errors detected in the karpenter namespace in cluster {{`{{ $labels.clusters }}`}} related to providerRef not found. - summary: Karpenter Resource Mismatch - labels: {} - isPaused: false - notification_settings: - receiver: Slack - - uid: fdwbeuftc7400c - title: Nginx is logging excessive " limiting requests, excess:" - condition: A - data: - - refId: A - queryType: instant - relativeTimeRange: - from: 600 - to: 0 - datasourceUid: loki - model: - datasource: - type: loki - uid: loki - editorMode: code - expr: sum by (app, cluster) (count_over_time({app=~".+", cluster=~".+"} |= "status:error" |= "limiting requests, excess:" [5m])) > 1000 - hide: false - intervalMs: 1000 - maxDataPoints: 43200 - queryType: instant - refId: A - noDataState: OK - execErrState: KeepLast - for: 5m - annotations: - description: 'More than 1000 "limiting requests, excess" errors detected in service {{`{{ $labels.app }}`}} (cluster: {{`{{ $labels.clusters }}`}}) within the last 5 minutes.' - summary: Nginx is logging excessive " limiting requests, excess:" - labels: {} - isPaused: false - notification_settings: - receiver: Slack - contactpoints.yaml: - secret: - apiVersion: 1 - contactPoints: - - orgId: 1 - name: slack - receivers: - - uid: first_uid - type: Slack - settings: - url: https://hooks.slack.com/services/XXXXXXXXXX - group: slack - summary: | - {{ `{{ include "default.message" . }}` }} - - -loki: - # -- Deploy Loki if enabled. See [upstream readme](https://github.com/grafana/helm-charts/tree/main/charts/loki-distributed#values) for full values reference. - enabled: true - -# -- Mimir chart values. Resources are set to a minimum by default. -mimir: - # -- Deploy Mimir if enabled. See [upstream values.yaml](https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml) for full values reference. - enabled: true - alertmanager: - resources: - requests: - cpu: 20m - compactor: - resources: - requests: - cpu: 20m - distributor: - resources: - requests: - cpu: 20m - ingester: - replicas: 2 - zoneAwareReplication: - enabled: false - resources: - requests: - cpu: 20m - overrides_exporter: - resources: - requests: - cpu: 20m - querier: - replicas: 1 - resources: - requests: - cpu: 20m - query_frontend: - resources: - requests: - cpu: 20m - query_scheduler: - replicas: 1 - resources: - requests: - cpu: 20m - ruler: - resources: - requests: - cpu: 20m - store_gateway: - zoneAwareReplication: - enabled: false - resources: - requests: - cpu: 20m - minio: - resources: - requests: - cpu: 20m - rollout_operator: - resources: - requests: - cpu: 20m - -tempo: - # -- Deploy Tempo if enabled. See [upstream readme](https://github.com/grafana/helm-charts/blob/main/charts/tempo-distributed/README.md#values) for full values reference. - enabled: true - ingester: - replicas: 3 - \ No newline at end of file diff --git a/helm/manifestservice/README.md b/helm/manifestservice/README.md index 4236568c..11fc1f39 100644 --- a/helm/manifestservice/README.md +++ b/helm/manifestservice/README.md @@ -85,5 +85,3 @@ A Helm chart for Kubernetes | volumeMounts | list | `[{"mountPath":"/var/gen3/config/","name":"config-volume","readOnly":true}]` | Volumes to mount to the container. | | volumes | list | `[{"name":"config-volume","secret":{"secretName":"manifestservice-g3auto"}}]` | Volumes to attach to the container. | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/metadata/README.md b/helm/metadata/README.md index 294abb91..c9553ba9 100644 --- a/helm/metadata/README.md +++ b/helm/metadata/README.md @@ -124,5 +124,3 @@ A Helm chart for gen3 Metadata Service | useAggMds | bool | `"True"` | Set to true to aggregate metadata from multiple other Metadata Service instances. | | volumeMounts | list | `[{"mountPath":"/src/.env","name":"config-volume-g3auto","readOnly":true,"subPath":"metadata.env"},{"mountPath":"/aggregate_config.json","name":"config-volume","readOnly":true,"subPath":"aggregate_config.json"},{"mountPath":"/metadata.json","name":"config-manifest","readOnly":true,"subPath":"json"}]` | Volumes to mount to the container. | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/observability/Chart.yaml b/helm/observability/Chart.yaml new file mode 100644 index 00000000..67ac2013 --- /dev/null +++ b/helm/observability/Chart.yaml @@ -0,0 +1,31 @@ +apiVersion: v2 +name: lgtma-chart +description: A Helm chart for deploying the LGTM stack with additional resources + +# A chart can be either an 'application' or a 'library' chart. +# +# Application charts are a collection of templates that can be packaged into versioned archives +# to be deployed. +# +# Library charts provide useful utilities or functions for the chart developer. They're included as +# a dependency of application charts to inject those utilities and functions into the rendering +# pipeline. Library charts do not define any templates and therefore cannot be deployed. +type: application + +# This is the chart version. This version number should be incremented each time you make changes +# to the chart and its templates, including the app version. +# Versions are expected to follow Semantic Versioning (https://semver.org/) +version: 0.1.0 + +# This is the version number of the application being deployed. This version number should be +# incremented each time you make changes to the application. Versions are not expected to +# follow Semantic Versioning. They should reflect the version the application is using. +# It is recommended to use it with quotes. +appVersion: "1.0.0" + +# Dependencies +dependencies: + - name: lgtm-distributed + version: "2.1.0" + alias: lgtm + repository: "https://grafana.github.io/helm-charts" \ No newline at end of file diff --git a/helm/observability/SETUP.md b/helm/observability/SETUP.md new file mode 100644 index 00000000..206a71fb --- /dev/null +++ b/helm/observability/SETUP.md @@ -0,0 +1,298 @@ +# Observability Helm Chart + +## Overview + +This Helm chart provides an all-in-one solution for deploying Mimir, Loki, and Grafana to your Kubernetes cluster, enabling a complete observability stack for metrics, logs, and visualization. + +### Grafana: +A leading open-source platform for data visualization and monitoring. Grafana allows you to create rich, interactive dashboards from a variety of data sources, making it easy to analyze metrics and logs from your systems. + +### Mimir: +Grafana Mimir is a highly scalable time-series database optimized for storing and querying metrics. It enables powerful alerting and querying for real-time monitoring of your infrastructure and applications. + +### Loki: +Grafana Loki is a log aggregation system designed to efficiently collect, store, and query logs from your applications. It works seamlessly with Grafana, providing an integrated way to visualize logs alongside metrics. + +By deploying this Helm chart, you'll set up these three components together, allowing you to monitor your systems and applications comprehensively with metrics from Mimir, logs from Loki, and dashboards and alerts in Grafana. +## General Architecture + +The Alloy Helm chart can be deployed across one or more environments or clusters. In this setup, Loki and Mimir are configured with internal ingress resources, enabling Alloy to send metrics and logs securely via VPC peering connections. Both Loki and Mimir write the ingested data to Amazon S3 for scalable and durable storage. This data can be queried and visualized through Grafana, which is hosted behind an internet-facing ingress. Access to Grafana can be restricted using CIDR ranges defined through the ALB ingress annotation: alb.ingress.kubernetes.io/inbound-cidrs: "cidrs". Additionally, the chart supports SAML authentication for Grafana, configured through the grafana.ini field, ensuring secure user access. + +![Grafana Architecture](image.png) + +### Fips compliant images + +Gen3 provides FIPS-compliant images, which are set as the default in the values file for Grafana, Mimir, and Loki. These images are self-hosted and maintained by the Gen3 platform team, ensuring secure and compliant operations. While the platform team manages image upgrades, the service versions will be updated as needed to align with operational requirements and best practices. + +### Helm Chart Links +The links below will take you to the Grafana LGTM chart, as well as the Grafana, Loki, and Mimir charts, providing a comprehensive list of configurable options to help you further customize your setup. +#### Link to lgtm Helm chart +- [LGTM Helm Chart](https://github.com/grafana/helm-charts/tree/main/charts/lgtm-distributed) +#### Full Configuration Options for all Components +- [Grafana](https://github.com/grafana/helm-charts/blob/main/charts/grafana/values.yaml) +- [Loki](https://github.com/grafana/helm-charts/blob/main/charts/loki-distributed/values.yaml) +- [Mimir](https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml) + +### Affinity Rules + +The affinity rule in the values.yaml file controls pod scheduling to specific nodes or zones. By default, pods are restricted to nodes in us-east-1a using a node label (topology.kubernetes.io/zone). + +Customize these rules to align with your cluster’s zones or labels to ensure pods can schedule properly. Mismatched configurations can lead to scheduling failures. + +```yaml + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a +``` + +### IRSA Role Setup + +This Helm chart automatically creates a service account named "observability" for use with Loki and Mimir. To ensure proper access to the storage buckets holding Loki and Mimir data, you’ll need to associate an AWS IAM Role with this service account. Configure the role with the necessary permissions to access the relevant S3 buckets, and then provide the role’s ARN in the appropriate section of your values.yaml file. + +```yaml +lgtm: + # -- (map) Configuration for IRSA role to use with service accounts. + role: + # -- (string) The arn of the aws role to associate with the service account that will be used for Loki and Mimir. + # Documentation on IRSA setup https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html + arn: +``` + +## Configuring Grafana + +When configuring the Grafana, you will need to update the hosts section of the values.yaml file to match the hostname you plan to use. For example, replace "grafana.example.com" with your desired hostname. + +### Ingress + +Grafana will require an internet-facing ingress in order to access the visualizations, alerts, etc. It is highly recommended that you uncomment and adjust the annotations provided for AWS ALB (Application Load Balancer) to fit your environment (if deploying via AWS). These annotations will help ensure proper configuration of the load balancer, SSL certificates, and other key settings. For instance, make sure to replace the placeholder values such as "cert arn", "ssl policy", and "environment name" with your specific details. Access to Grafana can be restricted using CIDR ranges defined through the ALB ingress annotation: alb.ingress.kubernetes.io/inbound-cidrs: "cidrs". + +```yaml +grafana: + ingress: + # -- (bool) Enable or disable ingress for Grafana. + enabled: true + # -- (map) Annotations for Grafana ingress. + annotations: + annotations: {} + ## Recommended annotations for AWS ALB (Application Load Balancer). + # alb.ingress.kubernetes.io/ssl-redirect: '443' + # alb.ingress.kubernetes.io/certificate-arn: + # alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]' + # alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=600 + # alb.ingress.kubernetes.io/scheme: internet-facing + # alb.ingress.kubernetes.io/ssl-policy: + # alb.ingress.kubernetes.io/tags: Environment= + # alb.ingress.kubernetes.io/target-type: 'ip' + # alb.ingress.kubernetes.io/inbound-cidrs: + # -- (list) Hostname(s) for Grafana ingress. + hosts: + - grafana.example.com + # -- (string) Ingress class name to be used (e.g., 'alb' for AWS Application Load Balancer). + ingressClassName: "alb" +``` + +### Built-in Gen3 Alerts + +This Helm chart comes equipped with built-in Gen3 alerts, defined in the 'alerting' section of the values.yaml. These alerts enable you to immediately leverage your logs and metrics as soon as Grafana is up and running. + +### Built-in Gen3 Dashboards + +We'll soon be releasing Gen3 dashboards, providing users with Gen3-specific visualizations. Please check back here to see if they have been released. + +## Configuring Mimir + +When configuring the Mimir, you will need to update the hosts section of the values.yaml file to match the hostname you plan to use. For example, replace "mimir.example.com" with your desired hostname. + +### Ingress + +Mimir will require an internal ingress in order to access the visualizations, alerts, etc. It is highly recommended that you uncomment and adjust the annotations provided for AWS ALB (Application Load Balancer) to fit your environment (if deploying via AWS). These annotations will help ensure proper configuration of the load balancer, SSL certificates, and other key settings. For instance, make sure to replace the placeholder values such as "cert arn", "ssl policy", and "environment name" with your specific details. + +```yaml +mimir: + ingress: + # -- (map) Annotations to add to mimir ingress. + annotations: {} + ## Recommended annotations for AWS ALB (Application Load Balancer). + # alb.ingress.kubernetes.io/certificate-arn: + # alb.ingress.kubernetes.io/ssl-redirect: '443' + # alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]' + # alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=600 + # alb.ingress.kubernetes.io/scheme: internal + # alb.ingress.kubernetes.io/ssl-policy: + # alb.ingress.kubernetes.io/tags: Environment= + # alb.ingress.kubernetes.io/target-type: ip + # -- (bool) Enable or disable mirmir ingress. + enabled: true + # -- (string) Class name for ingress. + ingressClassName: "alb" + # -- (map) Additional paths to add to the ingress. + paths: + # -- (list) Additional paths to add to the query frontend. + query-frontend: + - path: /prometheus/api/v1/query + # -- (list) hostname for mimir ingress. + hosts: + - mimir.example.com +``` + +### Storage Configuration + +The structuredConfig section in Mimir’s configuration defines how backend storage is set up to persist metrics and time-series data. This configuration ensures that data is safely stored and retrievable over time, even if Mimir instances restart or scale. + +If you are utilizing Amazon S3 for storage, make sure to uncomment "bucket_name" and input a value. + +```yaml +mimir: + # -- (map) Structured configuration settings for mimir. + structuredConfig: + common: + storage: + # -- (string) Backend storage configuration. For example, s3 for AWS S3 storage. + backend: s3 + s3: + # -- (string) The S3 endpoint to use for storage. Ensure this matches your region. + endpoint: s3.us-east-1.amazonaws.com + # -- (string) AWS region where your S3 bucket is located. + region: us-east-1 + # # -- (string) Name of the S3 bucket used for storage. + # bucket_name: +``` + +### Mimir Components +Mimir is a high-performance time-series database, typically used for storing and querying metrics. +1. **Alertmanager** + - **Pods**: `grafana-mimir-alertmanager-*` + - **Purpose**: Manages alert notifications and routing. + - **Function**: Sends alerts to different channels like email, Slack, etc., based on defined rules. + +2. **Compactor** + - **Pods**: `grafana-mimir-compactor-*` + - **Purpose**: Compacts time-series data to optimize storage. + - **Function**: Periodically reduces the size of stored metrics by merging smaller chunks. + +3. **Distributor** + - **Pods**: `grafana-mimir-distributor-*` + - **Purpose**: Accepts incoming metric data and distributes it to ingesters. + - **Function**: Acts as a load balancer for metric ingestion. + +4. **Ingester** + - **Pods**: `grafana-mimir-ingester-*` + - **Purpose**: Temporarily holds and processes incoming metric data. + - **Function**: Ingesters store time-series data in memory before flushing to long-term storage. + +5. **Querier** + - **Pods**: `grafana-mimir-querier-*` + - **Purpose**: Handles metric queries. + - **Function**: Retrieves time-series data from ingesters and long-term storage for queries. + +6. **Query Frontend** + - **Pods**: `grafana-mimir-query-frontend-*` + - **Purpose**: Coordinates and optimizes query execution. + - **Function**: Distributes query workloads to ensure performance and efficiency. + +7. **Query Scheduler** + - **Pods**: `grafana-mimir-query-scheduler-*` + - **Purpose**: Schedules query jobs across queriers. + - **Function**: Ensures balanced query processing across components. + +8. **Ruler** + - **Pods**: `grafana-mimir-ruler-*` + - **Purpose**: Evaluates recording and alerting rules. + - **Function**: Generates time-series data or alerts based on predefined rules. + +9. **Store Gateway** + - **Pods**: `grafana-mimir-store-gateway-*` + - **Purpose**: Provides access to long-term storage. + - **Function**: Optimizes retrieval of historical data from object stores. + +## Configuring Loki + + +When configuring the Loki, you will need to update the hosts section of the values.yaml file to match the hostname you plan to use. For example, replace "loki.example.com" with your desired hostname. + +### Ingress + +Loki will require an internal ingress in order to access the visualizations, alerts, etc. It is highly recommended that you uncomment and adjust the annotations provided for AWS ALB (Application Load Balancer) to fit your environment (if deploying via AWS). These annotations will help ensure proper configuration of the load balancer, SSL certificates, and other key settings. For instance, make sure to replace the placeholder values such as "cert arn", "ssl policy", and "environment name" with your specific details. + +```yaml +loki: + ingress: + # -- (map) Annotations to add to loki ingress. + annotations: {} + ## Recommended annotations for AWS ALB (Application Load Balancer). + # alb.ingress.kubernetes.io/certificate-arn: + # alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]' + # alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=600 + # alb.ingress.kubernetes.io/scheme: internal + # alb.ingress.kubernetes.io/ssl-policy: + # alb.ingress.kubernetes.io/ssl-redirect: '443' + # alb.ingress.kubernetes.io/tags: Environment= + # alb.ingress.kubernetes.io/target-type: ip + # -- (bool) Enable or disable loki ingress. + enabled: true + # -- (string) Class name for ingress. + ingressClassName: "alb" + # -- (list) Hosts for loki ingress. + hosts: + # -- (string) Hostname for loki ingress. + - host: loki.example.com +``` + +### Storage Configuration + +The structuredConfig section in Loki’s configuration defines how backend storage is set up to persist log data. This configuration ensures that logs are safely stored and retrievable over time, even if Loki instances restart or scale. + +If you are utilizing Amazon S3 for storage, make sure to uncomment "bucketnames" and input a value. + +```yaml +loki: + # -- (map) Structured configuration settings for Loki. + structuredConfig: + common: + # -- (string) Path prefix where Loki stores data. + path_prefix: /var/loki + storage: + # -- (null) Filesystem storage is disabled. + filesystem: null + s3: + # -- (string) AWS region for S3 storage. + region: us-east-1 + # # -- (string) S3 bucket names for Loki storage. + # bucketnames: +``` + +### Loki Components +Loki is used for log aggregation, querying, and management. Each Loki component has a specialized role in the log pipeline. +1. **Distributor** + - **Pods**: `grafana-loki-distributor-*` + - **Purpose**: Accepts log entries and forwards them to ingesters. + - **Function**: It load-balances logs from sources and ensures efficient distribution to ingesters. + +2. **Gateway** + - **Pods**: `grafana-loki-gateway-*` + - **Purpose**: Acts as an API gateway or entry point for requests. + - **Function**: Can be used for proxying queries to the appropriate backend components. + +3. **Ingester** + - **Pods**: `grafana-loki-ingester-*` + - **Purpose**: Receives and stores log entries in chunks. + - **Function**: Ingesters temporarily hold logs in memory and periodically flush them to storage (like S3 or other object stores). + +4. **Querier** + - **Pods**: `grafana-loki-querier-*` + - **Purpose**: Handles log queries from users. + - **Function**: Retrieves logs from ingesters and long-term storage for querying purposes. + +5. **Query Frontend** + - **Pods**: `grafana-loki-query-frontend-*` + - **Purpose**: Distributes and coordinates queries. + - **Function**: Splits large queries into smaller ones for faster execution by the queriers. \ No newline at end of file diff --git a/helm/observability/image.png b/helm/observability/image.png new file mode 100644 index 00000000..7ed5d6ac Binary files /dev/null and b/helm/observability/image.png differ diff --git a/helm/observability/templates/observability-sa.yaml b/helm/observability/templates/observability-sa.yaml new file mode 100644 index 00000000..14c97409 --- /dev/null +++ b/helm/observability/templates/observability-sa.yaml @@ -0,0 +1,7 @@ +apiVersion: v1 +automountServiceAccountToken: true +kind: ServiceAccount +metadata: + annotations: + eks.amazonaws.com/role-arn: {{ .Values.lgtm.role.arn | quote }} + name: observability \ No newline at end of file diff --git a/helm/observability/values.yaml b/helm/observability/values.yaml new file mode 100644 index 00000000..d2cc2cfb --- /dev/null +++ b/helm/observability/values.yaml @@ -0,0 +1,1108 @@ +--- +lgtm: + # -- (map) Configuration for IRSA role to use with service accounts. + role: + # -- (string) The arn of the aws role to associate with the service account that will be used for Loki and Mimir. + # Documentation on IRSA setup https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html + arn: + + # -- (map) Tempo configuration (currently disabled). + tempo: + # -- (bool) Enable or disable tempo. + enabled: false + + # -- (map) Mimir configuration. + mimir: + # -- (map) Docker image information. + image: + # -- (string) The Docker image repository for mimir. + repository: quay.io/cdis/mimir + # -- (string) The Docker image tag for the mimir. + tag: master + # -- (map) Mimir ingress configuration. + ingress: + # -- (map) Annotations to add to mimir ingress. + annotations: {} + ## Recommended annotations for AWS ALB (Application Load Balancer). + # alb.ingress.kubernetes.io/certificate-arn: + # alb.ingress.kubernetes.io/ssl-redirect: '443' + # alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]' + # alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=600 + # alb.ingress.kubernetes.io/scheme: internal + # alb.ingress.kubernetes.io/ssl-policy: + # alb.ingress.kubernetes.io/tags: Environment= + # alb.ingress.kubernetes.io/target-type: ip + # -- (bool) Enable or disable mirmir ingress. + enabled: true + # -- (string) Class name for ingress. + ingressClassName: "alb" + # -- (map) Additional paths to add to the ingress. + paths: + # -- (list) Additional paths to add to the query frontend. + query-frontend: + - path: /prometheus/api/v1/query + # -- (list) hostname for mimir ingress. + hosts: + - mimir.example.com + + # -- (map) minio configuration. + minio: + # -- (bool) Enable or disable minio. + enabled: false + + # -- (map) Rollout Operator configuration. + rollout_operator: + # -- (map) Docker image information. + image: + # -- (string) The Docker image repository for the rollout-operator. + repository: quay.io/cdis/rollout-operator + # -- (string) The Docker image tag for the rollout-operator. + tag: master + serviceAccount: + # -- (bool) Whether to create a service account or not. In case 'create' is false, do set 'name' to an existing service account name. The "observability" SA will be created by default via Helm. + create: false + # -- (string) Override for the generated service account name. + name: observability + + mimir: + # -- (map) Structured configuration settings for mimir. + structuredConfig: + limits: + # -- (int) Maximum number of global series allowed per user. Set to '0' for unlimited. + max_global_series_per_user: 0 + # -- (int) The rate limit for ingestion, measured in samples per second. + ingestion_rate: 10000000 + common: + storage: + # -- (string) Backend storage configuration. For example, s3 for AWS S3 storage. + backend: s3 + s3: + # -- (string) The S3 endpoint to use for storage. Ensure this matches your region. + endpoint: s3.us-east-1.amazonaws.com + # -- (string) AWS region where your S3 bucket is located. + region: us-east-1 + # # -- (string) Name of the S3 bucket used for storage. + # bucket_name: + blocks_storage: + # -- (string) Prefix used for storing blocks data. + storage_prefix: blocks + alertmanager_storage: + # -- (string) Prefix used for storing Alertmanager data. + storage_prefix: alertmanager + ruler_storage: + # -- (string) Prefix used for storing ruler data. + storage_prefix: ruler + query_scheduler: + # -- (string) Mode for service discovery in the query scheduler. Set to 'dns' for DNS-based service discovery. + service_discovery_mode: "dns" + + alertmanager: + # -- (map) Configuration for persistent volume in Alertmanager. + persistentVolume: + # -- (bool) Enable or disable the persistent volume for Alertmanager. Set to 'true' to enable, 'false' to disable. + enabled: true + # -- (int) Number of replicas for Alertmanager. Determines how many instances of Alertmanager to run. + replicas: 3 + # -- (map) Affinity rules for scheduling Alertmanager pods. + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + resources: + # -- (map) Resource limits for Alertmanager pods. + limits: + # -- (string) Memory limit for Alertmanager pods. + memory: 2Gi + # -- (map) Resource requests for Alertmanager pods. + requests: + # -- (string) CPU request for Alertmanager pods. Determines how much CPU is guaranteed for the pod. + cpu: 1 + # -- (string) Memory request for Alertmanager pods. Determines how much memory is guaranteed for the pod. + memory: 1Gi + # -- (map) Configuration for deploying Alertmanager as a StatefulSet. + statefulSet: + # -- (bool) Enable or disable the StatefulSet deployment for Alertmanager. Set to 'true' to enable, 'false' to disable. + enabled: true + + compactor: + # -- (map) Affinity rules for scheduling compactor pods. + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + # -- (map) Persistent volume configuration for the compactor component. + persistentVolume: + # -- (string) Size of the persistent volume to be used by the compactor. + size: 50Gi + resources: + # -- (map) Resource limits for the compactor component. + limits: + # -- (string) Memory limit for the compactor pods. + memory: 3Gi + # -- (map) Resource requests for the compactor component. + requests: + # -- (string) CPU request for the compactor pods. Determines how much CPU is guaranteed for the pod. + cpu: 1 + # -- (string) Memory request for the compactor pods. Determines how much memory is guaranteed for the pod. + memory: 2Gi + + distributor: + # -- (map) Affinity rules for scheduling distributor pods. + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + # -- (int) Number of replicas for the distributor component. Determines how many instances to run. + replicas: 3 + resources: + # -- (map) Resource limits for the distributor component. + limits: + # -- (string) Memory limit for the distributor pods. + memory: 12Gi + # -- (map) Resource requests for the distributor component. + requests: + # -- (string) CPU request for the distributor pods. Determines how much CPU is guaranteed for the pod. + cpu: 2 + # -- (string) Memory request for the distributor pods. Determines how much memory is guaranteed for the pod. + memory: 8Gi + + ingester: + # -- (map) Persistent volume configuration for the ingester component. + persistentVolume: + # -- (string) Size of the persistent volume to be used by the ingester. + size: 50Gi + # -- (int) Number of replicas for the ingester component. Determines how many instances to run. + replicas: 5 + resources: + # -- (map) Resource limits for the ingester component. + limits: + # -- (string) Memory limit for the ingester pods. + memory: 12Gi + # -- (map) Resource requests for the ingester component. + requests: + # -- (string) CPU request for the ingester pods. Determines how much CPU is guaranteed for the pod. + cpu: 3.5 + # -- (string) Memory request for the ingester pods. Determines how much memory is guaranteed for the pod. + memory: 8Gi + # -- (map) Topology spread constraints for the ingester component. Empty by default. + topologySpreadConstraints: {} + affinity: + # -- (map) Affinity rules for scheduling ingester pods. + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + # -- (map) Zone-aware replication settings. Helps distribute data across zones. + zoneAwareReplication: + # -- (string) Topology key used for zone-aware replication. + topologyKey: 'kubernetes.io/hostname' + + overrides_exporter: + # -- (map) Affinity rules for scheduling overrides_exporter pods. + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + # -- (int) Number of replicas for the overrides_exporter component. Determines how many instances to run. + replicas: 1 + resources: + # -- (map) Resource limits for the overrides_exporter component. + limits: + # -- (string) Memory limit for the overrides_exporter pods. + memory: 128Mi + # -- (map) Resource requests for the overrides_exporter component. + requests: + # -- (string) CPU request for the overrides_exporter pods. Determines how much CPU is guaranteed for the pod. + cpu: 100m + # -- (string) Memory request for the overrides_exporter pods. Determines how much memory is guaranteed for the pod. + memory: 128Mi + + querier: + # -- (map) Affinity rules for scheduling querier pods. + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + # -- (int) Number of replicas for the querier component. Determines how many instances to run. + replicas: 3 + resources: + # -- (map) Resource limits for the querier component. + limits: + # -- (string) Memory limit for the querier pods. + memory: 8Gi + # -- (map) Resource requests for the querier component. + requests: + # -- (string) CPU request for the querier pods. Determines how much CPU is guaranteed for the pod. + cpu: 2 + # -- (string) Memory request for the querier pods. Determines how much memory is guaranteed for the pod. + memory: 6Gi + + query_scheduler: + # -- (map) Affinity rules for scheduling query_scheduler pods. + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + + query_frontend: + # -- (map) Affinity rules for scheduling query_frontend pods. + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + # -- (int) Number of replicas for the query_frontend component. Determines how many instances to run. + replicas: 2 + resources: + # -- (map) Resource limits for the query_frontend component. + limits: + # -- (string) Memory limit for the query_frontend pods. + memory: 3Gi + # -- (map) Resource requests for the query_frontend component. + requests: + # -- (string) CPU request for the query_frontend pods. Determines how much CPU is guaranteed for the pod. + cpu: 2 + # -- (string) Memory request for the query_frontend pods. Determines how much memory is guaranteed for the pod. + memory: 2Gi + + + ruler: + # -- (map) Affinity rules for scheduling ruler pods. + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + # -- (int) Number of replicas for the ruler component. Determines how many instances to run. + replicas: 2 + resources: + # -- (map) Resource limits for the ruler component. + limits: + # -- (string) Memory limit for the ruler pods. + memory: 5Gi + # -- (map) Resource requests for the ruler component. + requests: + # -- (string) CPU request for the ruler pods. Determines how much CPU is guaranteed for the pod. + cpu: 1 + # -- (string) Memory request for the ruler pods. Determines how much memory is guaranteed for the pod. + memory: 4Gi + + store_gateway: + # -- (map) Affinity rules for scheduling store_gateway pods. + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + # -- (map) Persistent volume configuration for the store_gateway component. + persistentVolume: + # -- (string) Size of the persistent volume to be used by the store_gateway. + size: 50Gi + # -- (int) Number of replicas for the store_gateway component. Determines how many instances to run. + replicas: 2 + resources: + # -- (map) Resource limits for the store_gateway component. + limits: + # -- (string) Memory limit for the store_gateway pods. + memory: 8Gi + # -- (map) Resource requests for the store_gateway component. + requests: + # -- (string) CPU request for the store_gateway pods. Determines how much CPU is guaranteed for the pod. + cpu: 1 + # -- (string) Memory request for the store_gateway pods. Determines how much memory is guaranteed for the pod. + memory: 6Gi + # -- (map) Topology spread constraints for the store_gateway component. Empty by default. + topologySpreadConstraints: {} + # -- (map) Zone-aware replication settings. Helps distribute data across zones. + zoneAwareReplication: + # -- (string) Topology key used for zone-aware replication. + topologyKey: 'kubernetes.io/hostname' + + nginx: + # -- (string) Affinity rules for scheduling nginx pods. Passed in as a multiline string. + affinity: | + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + image: + # -- (string) Container image registry for nginx. + registry: quay.io/nginx + # -- (string) Repository for nginx unprivileged image. + repository: nginx-unprivileged + # -- (int) Number of replicas for the nginx component. Determines how many instances to run. + replicas: 3 + resources: + # -- (map) Resource limits for the nginx component. + limits: + # -- (string) Memory limit for the nginx pods. + memory: 731Mi + # -- (map) Resource requests for the nginx component. + requests: + # -- (string) CPU request for the nginx pods. Determines how much CPU is guaranteed for the pod. + cpu: 1 + # -- (string) Memory request for the nginx pods. Determines how much memory is guaranteed for the pod. + memory: 512Mi + + gateway: + # -- (map) Affinity rules for scheduling gateway pods. + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + # -- (int) Number of replicas for the gateway component. Determines how many instances to run. + replicas: 3 + resources: + # -- (map) Resource limits for the gateway component. + limits: + # -- (string) Memory limit for the gateway pods. + memory: 731Mi + # -- (map) Resource requests for the gateway component. + requests: + # -- (string) CPU request for the gateway pods. Determines how much CPU is guaranteed for the pod. + cpu: 1 + # -- (string) Memory request for the gateway pods. Determines how much memory is guaranteed for the pod. + memory: 512Mi + + + # -- (map) Loki configuration. + loki: + # -- (map) Persistence settings for loki. + persistence: + # -- (bool) Enable or disable persistence. + enabled: true + # -- (string) Service account configuration for loki. + serviceAccount: + # -- (string) Service account to use (will be created by default via this helm chart). + name: observability + gateway: + # -- (string) Affinity rules for scheduling gateway pods. Passed in as a multiline string. + affinity: | + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: topology.kubernetes.io/zone + operator: In + values: + - us-east-1a + # -- (map) Loki ingress configuration. + ingress: + # -- (map) Annotations to add to loki ingress. + annotations: {} + ## Recommended annotations for AWS ALB (Application Load Balancer). + # alb.ingress.kubernetes.io/certificate-arn: + # alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]' + # alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=600 + # alb.ingress.kubernetes.io/scheme: internal + # alb.ingress.kubernetes.io/ssl-policy: + # alb.ingress.kubernetes.io/ssl-redirect: '443' + # alb.ingress.kubernetes.io/tags: Environment= + # alb.ingress.kubernetes.io/target-type: ip + # -- (bool) Enable or disable loki ingress. + enabled: true + # -- (string) Class name for ingress. + ingressClassName: "alb" + # -- (list) Hosts for loki ingress. + hosts: + # -- (string) Hostname for loki ingress. + - host: loki.example.com + paths: + # New data structure introduced + - path: / + # Newly added optional property + pathType: Prefix + + # -- (map) Scaling and configuring loki querier. + querier: + # -- (map) Resource requests and limits for querier. + resources: + # -- (map) Resource limits for the querier component. + limits: + # -- (string) Memory limit for the querier pods. + memory: 6Gi + # -- (map) Resource requests for the querier component. + requests: + # -- (string) CPU request for the querier pods. Determines how much CPU is guaranteed for the pod. + cpu: 2 + # -- (string) Memory request for the querier pods. Determines how much memory is guaranteed for the pod. + memory: 4Gi + # -- (string) Affinity rules for scheduling querier pods. Passed in as a multiline string. + affinity: | + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + + # -- (map) Scaling and configuring loki queryFrontend. + queryFrontend: + # -- (map) Resource requests and limits for queryFrontend. + resources: + # -- (map) Resource limits for the queryFrontend component. + limits: + # -- (string) Memory limit for the queryFrontend pods. + memory: 6Gi + # -- (map) Resource requests for the queryFrontend component. + requests: + # -- (string) CPU request for the queryFrontend pods. Determines how much CPU is guaranteed for the pod. + cpu: 2 + # -- (string) Memory request for the queryFrontend pods. Determines how much memory is guaranteed for the pod. + memory: 4Gi + # -- (map) Affinity rules for scheduling queryFrontend pods. Passed in as a multiline string. + affinity: | + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + + # -- (map) Scaling and configuring loki distributor. + distributor: + # -- (map) Affinity rules for scheduling distributor pods. Passed in as a multiline string. + affinity: | + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + # -- (int) Number of replicas for the distributor component. Determines how many instances to run. + replicas: 3 + # -- (int) Maximum number of unavailable replicas allowed during an update. + maxUnavailable: 2 + resources: + # -- (map) Resource limits for the distributor component. + limits: + # -- (string) Memory limit for the distributor pods. + memory: 6Gi + # -- (map) Resource requests for the distributor component. + requests: + # -- (string) CPU request for the distributor pods. Determines how much CPU is guaranteed for the pod. + cpu: 2 + # -- (string) Memory request for the distributor pods. Determines how much memory is guaranteed for the pod. + memory: 4Gi + + + # -- (map) Scaling and configuring loki ingester. Passed in as a multiline string. + ingester: + # -- (map) Affinity rules for scheduling ingester pods. + affinity: | + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + # -- (map) Persistent volume configuration for the ingester component. + persistentVolume: + # -- (string) Size of the persistent volume to be used by the ingester. + size: 50Gi + # -- (int) Number of replicas for the ingester component. Determines how many instances to run. + replicas: 3 + # -- (int) Maximum number of unavailable replicas allowed during an update. + maxUnavailable: 2 + resources: + # -- (map) Resource limits for the ingester component. + limits: + # -- (string) Memory limit for the ingester pods. + memory: 12Gi + # -- (map) Resource requests for the ingester component. + requests: + # -- (string) CPU request for the ingester pods. Determines how much CPU is guaranteed for the pod. + cpu: 3.5 + # -- (string) Memory request for the ingester pods. Determines how much memory is guaranteed for the pod. + memory: 8Gi + + + # -- (map) Loki configuration. + loki: + # -- (map) Loki image details. + image: + # -- (string) Container image registry for Loki. + registry: quay.io/cdis + # -- (string) Repository for the Loki image. + repository: loki + # -- (string) Tag for the Loki image version. + tag: master + + # -- (map) Schema configuration for Loki. + schemaConfig: + configs: + - from: 2024-04-01 + # -- (string) Storage engine used by Loki. + store: tsdb + # -- (string) Object store for Loki data (e.g., S3). + object_store: s3 + # -- (string) Schema version for Loki. + schema: v13 + # -- (map) Index configuration for Loki. + index: + # -- (string) Prefix for the Loki index. + prefix: loki_index_ + # -- (string) Index rotation period for Loki, in hours. + period: 24h + # -- (map) Structured configuration settings for Loki. + structuredConfig: + server: + # -- (string) Log level for Loki server. Options include 'info', 'debug', etc. + log_level: debug + limits_config: + # -- (int) Maximum number of series that can be queried at once. + max_query_series: 30000 + # -- (int) Maximum number of streams a single user can have. + max_streams_per_user: 100000 + # -- (int) Maximum number of log entries per query. + max_entries_limit_per_query: 100000000 + common: + # -- (string) Path prefix where Loki stores data. + path_prefix: /var/loki + storage: + # -- (null) Filesystem storage is disabled. + filesystem: null + s3: + # -- (string) AWS region for S3 storage. + region: us-east-1 + # # -- (string) S3 bucket names for Loki storage. + # bucketnames: + + # -- (map) Grafana configuration. + grafana: + # -- (bool) Deploy Grafana if enabled. See [upstream readme](https://github.com/grafana/helm-charts/tree/main/charts/grafana#configuration) for full values reference. + enabled: true + # -- (map) Affinity rules for scheduling Grafana pods. + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + # -- (string) Node label key for affinity. Ensures pods are scheduled on nodes in the specified zone. + - key: topology.kubernetes.io/zone + # -- (string) Operator to apply to the node selector. 'In' means the node must match one of the values. + operator: In + # -- (list) List of values for the node selector, representing allowed zones. + values: + - us-east-1a + # -- (map) Init container to chown data directories for Grafana. + initChownData: + image: + # -- (string) Container image registry for the init container. + registry: quay.io/cdis + # -- (string) Repository for the busybox image. + repository: busybox + # -- (string) Tag for the busybox image version. + tag: 1.32.0 + # -- (map) Image used to download Grafana dashboards. + downloadDashboardsImage: + # -- (string) Container image registry for the dashboard download image. + registry: quay.io/curl + # -- (string) Repository for the curl image. + repository: curl + # -- (string) Tag for the curl image version. + tag: 8.8.0 + + # -- (string) Reference a secret for environment variables. + envFromSecret: + ingress: + # -- (bool) Enable or disable ingress for Grafana. + enabled: true + # -- (map) Annotations for Grafana ingress. + annotations: {} + + ## Recommended annotations for AWS ALB (Application Load Balancer). + # alb.ingress.kubernetes.io/ssl-redirect: '443' + # alb.ingress.kubernetes.io/certificate-arn: + # alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]' + # alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=600 + # alb.ingress.kubernetes.io/scheme: internet-facing + # alb.ingress.kubernetes.io/ssl-policy: + # alb.ingress.kubernetes.io/tags: Environment= + # alb.ingress.kubernetes.io/target-type: 'ip' + # alb.ingress.kubernetes.io/inbound-cidrs: + # -- (list) Hostname(s) for Grafana ingress. + hosts: + - grafana.example.com + # -- (string) Ingress class name to be used (e.g., 'alb' for AWS Application Load Balancer). + ingressClassName: "alb" + tls: + # -- (list) TLS configuration for the ingress. Reference to a secret that contains the TLS certificate. + - secretName: aws-load-balancer-tls + # -- (map) Persistence configuration for Grafana. + persistence: + # -- (bool) Enable or disable persistence for Grafana data. + enabled: true + + # -- (map) Image configuration for Grafana. + image: + # -- (string) Container image registry for Grafana. + registry: quay.io/cdis + # -- (string) Repository for the Grafana image. + repository: grafana + # -- (string) Pull policy for the Grafana image (e.g., 'Always'). + pullPolicy: Always + # -- (string) Tag for the Grafana image version. + tag: master + + # -- (map) Environment variables for Grafana. + env: + # -- (string) Root URL configuration for the Grafana server. + GF_SERVER_ROOT_URL: "https://grafana.example.com" + + # -- (map) Configuration for dashboard providers in Grafana. + dashboardProviders: + dashboardproviders.yaml: + # -- (int) API version for dashboard provider configuration. + apiVersion: 1 + # -- (list) List of dashboard providers. + providers: + - name: 'grafana-dashboards-kubernetes' + # -- (int) Organization ID in Grafana. + orgId: 1 + # -- (string) Folder where the dashboards will be placed in Grafana. + folder: 'Kubernetes' + # -- (string) Type of dashboard provider, usually 'file'. + type: file + # -- (bool) Prevent deletion of the provided dashboards. + disableDeletion: true + # -- (bool) Allow editing of the dashboards. + editable: true + # -- (map) Options for the dashboard provider. + options: + # -- (string) Path to the dashboard files. + path: /var/lib/grafana/dashboards/grafana-dashboards-kubernetes + + # -- (map) Dashboards configuration. URLs to fetch specific Kubernetes-related Grafana dashboards. + # Gen3 specific dashboards can be found here. https://github.com/uc-cdis/grafana-dashboards + dashboards: + grafana-dashboards-kubernetes: + k8s-system-api-server: + # -- (string) URL to the dashboard JSON file for the Kubernetes API server. + url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-system-api-server.json + # -- (string) Authentication token for accessing the dashboard URL (optional). + token: '' + k8s-system-coredns: + # -- (string) URL to the dashboard JSON file for CoreDNS in Kubernetes. + url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-system-coredns.json + token: '' + k8s-views-global: + # -- (string) URL to the dashboard JSON file for global views in Kubernetes. + url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-global.json + token: '' + k8s-views-namespaces: + # -- (string) URL to the dashboard JSON file for Kubernetes namespace views. + url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-namespaces.json + token: '' + k8s-views-nodes: + # -- (string) URL to the dashboard JSON file for Kubernetes node views. + url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-nodes.json + token: '' + k8s-views-pods: + # -- (string) URL to the dashboard JSON file for Kubernetes pod views. + url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-pods.json + token: '' + + grafana.ini: + # -- (map) Okta authentication settings in Grafana. + auth.okta: + # -- (bool) Enable or disable Okta authentication. + enabled: true + # -- (string) Icon used for Okta in the Grafana UI. + icon: okta + # -- (bool) Allow users to sign up automatically using Okta. + allow_sign_up: true + # -- (bool) Automatically log in users using Okta when visiting Grafana. + auto_login: true + # # -- (string) Okta client ID. + # client_id: + # # -- (string) Okta client secret. + # client_secret: + # # -- (string) Okta authorization URL. + # auth_url: + # # -- (string) Okta token URL. + # token_url: + # # -- (string) Okta API URL. + # api_url: + # -- (map) User configuration settings in Grafana. + users: + # -- (string) Auto-assign the specified role to new users upon login. Options: Viewer, Editor, Admin. + auto_assign_org_role: Editor + # -- (map) Logging configuration in Grafana. + log: + # -- (string) Logging level for Grafana. Options: debug, info, warn, error. + level: debug + # -- (map) Server configuration in Grafana. + server: + # -- (string) Domain name for the Grafana server. + domain: grafana.example.com + # -- (string) Root URL for Grafana, using the domain name. + root_url: "https://%(domain)s/" + # -- (map) Feature toggles in Grafana. + feature_toggles: + # -- (bool) Enable Single Sign-On (SSO) settings API. + ssoSettingsApi: true + # -- (bool) Enable support for transformations using variables in Grafana. + transformationsVariableSupport: true + # -- (list) Features to be enabled in Grafana. + enable: ssoSettingsAPI transformationsVariableSupport + + # -- (map) Gen3 built-in alerting configuration in Grafana. + alerting: + # -- (string) Alerting rules configuration file. + rules.yaml: + # -- (int) API version for the alerting rules configuration. + apiVersion: 1 + # -- (list) Groups of alerting rules. + groups: + - orgId: 1 + # -- (string) Name of the alert group. + name: Alerts + # -- (string) Folder where the alerts will be placed in Grafana. + folder: Alerts + # -- (string) Interval at which the alert rules are evaluated. + interval: 5m + # -- (list) List of alerting rules to be defined (add specific rules here). + rules: + - uid: edwb8zgcvq96oc + title: HTTP 500 errors detected + condition: A + data: + - refId: A + queryType: instant + relativeTimeRange: + from: 600 + to: 0 + datasourceUid: loki + model: + datasource: + type: loki + uid: loki + editorMode: code + expr: sum by (cluster) (count_over_time({cluster=~".+"} | json | http_status_code="500" [1h])) > 0 + hide: false + intervalMs: 1000 + maxDataPoints: 43200 + queryType: instant + refId: A + noDataState: OK + execErrState: KeepLast + for: 5m + annotations: + summary: 'Alert: HTTP 500 errors detected in the environment: {{`{{ $labels.clusters }}`}}' + labels: {} + isPaused: false + notification_settings: + receiver: Slack + - uid: adwb9vhb7irr4b + title: Error Logs Detected in Usersync Job + condition: A + data: + - refId: A + queryType: instant + relativeTimeRange: + from: 600 + to: 0 + datasourceUid: loki + model: + datasource: + type: loki + uid: loki + editorMode: code + expr: sum by (cluster, namespace) (count_over_time({ app="gen3job", job_name=~"usersync-.*"} |= "ERROR - could not revoke policies from user `N/A`" [5m])) > 1 + hide: false + intervalMs: 1000 + maxDataPoints: 43200 + queryType: instant + refId: A + noDataState: OK + execErrState: KeepLast + for: 5m + annotations: + description: Error in usersync job detected in cluster {{`{{ $labels.clusters }}`}}, namespace {{`{{ $labels.namespace }}`}}. + summary: Error Logs Detected in Usersync Job + labels: {} + isPaused: false + notification_settings: + receiver: Slack + - uid: ddwbc12l6wc8wf + title: Hatchery panic in {{`{{ env.name }}`}} + condition: A + data: + - refId: A + queryType: instant + relativeTimeRange: + from: 600 + to: 0 + datasourceUid: loki + model: + datasource: + type: loki + uid: loki + editorMode: code + expr: sum by (cluster) (count_over_time({app="hatchery"} |= "panic" [5m])) > 1 + hide: false + intervalMs: 1000 + maxDataPoints: 43200 + queryType: instant + refId: A + noDataState: OK + execErrState: KeepLast + for: 5m + annotations: + description: Panic detected in app {{`{{ $labels.app }}`}} within cluster {{`{{ $labels.clusters }}`}}. + summary: Hatchery panic + labels: {} + isPaused: false + notification_settings: + receiver: Slack + - uid: cdwbcbphz1zb4a + title: Http status code 431 + condition: A + data: + - refId: A + queryType: instant + relativeTimeRange: + from: 600 + to: 0 + datasourceUid: loki + model: + datasource: + type: loki + uid: loki + editorMode: code + expr: sum(count_over_time({cluster=~".+"} | json | http_status_code="431" [5m])) >= 2 + hide: false + intervalMs: 1000 + maxDataPoints: 43200 + queryType: instant + refId: A + noDataState: OK + execErrState: KeepLast + for: 5m + annotations: + description: Detected 431 HTTP status codes in the logs within the last 5 minutes. + summary: Http status code 431 + labels: {} + isPaused: false + notification_settings: + receiver: Slack + - uid: bdwbck1lgwdfka + title: Indexd is getting an excessive amount of traffic + condition: A + data: + - refId: A + queryType: instant + relativeTimeRange: + from: 600 + to: 0 + datasourceUid: loki + model: + datasource: + type: loki + uid: loki + editorMode: code + expr: sum by (cluster) (count_over_time({cluster=~".+", app="indexd", status="info"} [5m])) > 50000 + hide: false + intervalMs: 1000 + maxDataPoints: 43200 + queryType: instant + refId: A + noDataState: OK + execErrState: KeepLast + for: 5m + annotations: + description: High number of info status logs detected in the indexd service in cluster {{`{{ $labels.clusters }}`}}. + summary: Indexd is getting an excessive amount of traffic + labels: {} + isPaused: false + notification_settings: + receiver: Slack + - uid: fdwbe5t439zpcd + title: Karpenter Resource Mismatch + condition: A + data: + - refId: A + queryType: instant + relativeTimeRange: + from: 600 + to: 0 + datasourceUid: loki + model: + datasource: + type: loki + uid: loki + editorMode: code + expr: | + sum by (cluster) (count_over_time({namespace="karpenter", cluster=~".+"} |= "ERROR" |= "not found" |= "getting providerRef" [5m])) > 10 + hide: false + intervalMs: 1000 + maxDataPoints: 43200 + queryType: instant + refId: A + noDataState: OK + execErrState: KeepLast + for: 5m + annotations: + description: More than 10 errors detected in the karpenter namespace in cluster {{`{{ $labels.clusters }}`}} related to providerRef not found. + summary: Karpenter Resource Mismatch + labels: {} + isPaused: false + notification_settings: + receiver: Slack + - uid: fdwbeuftc7400c + title: Nginx is logging excessive " limiting requests, excess:" + condition: A + data: + - refId: A + queryType: instant + relativeTimeRange: + from: 600 + to: 0 + datasourceUid: loki + model: + datasource: + type: loki + uid: loki + editorMode: code + expr: sum by (app, cluster) (count_over_time({app=~".+", cluster=~".+"} |= "status:error" |= "limiting requests, excess:" [5m])) > 1000 + hide: false + intervalMs: 1000 + maxDataPoints: 43200 + queryType: instant + refId: A + noDataState: OK + execErrState: KeepLast + for: 5m + annotations: + description: 'More than 1000 "limiting requests, excess" errors detected in service {{`{{ $labels.app }}`}} (cluster: {{`{{ $labels.clusters }}`}}) within the last 5 minutes.' + summary: Nginx is logging excessive " limiting requests, excess:" + labels: {} + isPaused: false + notification_settings: + receiver: Slack + contactpoints.yaml: + secret: + apiVersion: 1 + contactPoints: + - orgId: 1 + name: slack + receivers: + - uid: first_uid + type: Slack + settings: + url: https://hooks.slack.com/services/XXXXXXXXXX + group: slack + summary: | + {{ `{{ include "default.message" . }}` }} \ No newline at end of file diff --git a/helm/peregrine/README.md b/helm/peregrine/README.md index 8d1e3674..8d9884c5 100644 --- a/helm/peregrine/README.md +++ b/helm/peregrine/README.md @@ -102,5 +102,3 @@ A Helm chart for gen3 Peregrine service | volumeMounts | list | `[{"mountPath":"/var/www/peregrine/settings.py","name":"config-volume","readOnly":true,"subPath":"settings.py"}]` | Volumes to mount to the container. | | volumes | list | `[{"emptyDir":{},"name":"shared-data"},{"name":"config-volume","secret":{"secretName":"peregrine-secret"}}]` | Volumes to attach to the container. | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/pidgin/README.md b/helm/pidgin/README.md index 06e094e3..21914338 100644 --- a/helm/pidgin/README.md +++ b/helm/pidgin/README.md @@ -82,5 +82,3 @@ A Helm chart for gen3 Pidgin Service | strategy.rollingUpdate.maxSurge | int | `1` | Number of additional replicas to add during rollout. | | strategy.rollingUpdate.maxUnavailable | int | `0` | Maximum amount of pods that can be unavailable during the update. | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/portal/README.md b/helm/portal/README.md index 91329bc4..daafacfc 100644 --- a/helm/portal/README.md +++ b/helm/portal/README.md @@ -101,5 +101,3 @@ A Helm chart for gen3 data-portal | strategy.rollingUpdate.maxUnavailable | int | `"25%"` | Maximum amount of pods that can be unavailable during the update. | | tolerations | list | `[]` | Tolerations to apply to the pod | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/requestor/README.md b/helm/requestor/README.md index da178289..85792b12 100644 --- a/helm/requestor/README.md +++ b/helm/requestor/README.md @@ -117,5 +117,3 @@ A Helm chart for gen3 Requestor Service | strategy.rollingUpdate.maxUnavailable | int | `0` | Maximum amount of pods that can be unavailable during the update. | | volumeMounts | list | `nil` | Volumes to mount to the container. | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/revproxy/README.md b/helm/revproxy/README.md index 8d4bb54e..59baa504 100644 --- a/helm/revproxy/README.md +++ b/helm/revproxy/README.md @@ -104,5 +104,3 @@ A Helm chart for gen3 revproxy | tolerations | list | `[]` | Tolerations to use for the deployment. | | userhelperEnabled | bool | `false` | | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/sheepdog/README.md b/helm/sheepdog/README.md index 1f744fa3..afbdd189 100644 --- a/helm/sheepdog/README.md +++ b/helm/sheepdog/README.md @@ -110,5 +110,3 @@ A Helm chart for gen3 Sheepdog Service | terminationGracePeriodSeconds | int | `50` | sheepdog transactions take forever - try to let the complete before termination | | volumeMounts | list | `[{"mountPath":"/var/www/sheepdog/settings.py","name":"config-volume","readOnly":true,"subPath":"settings.py"}]` | Volumes to mount to the container. | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/sower/README.md b/helm/sower/README.md index 62fd6a20..9644ad2e 100644 --- a/helm/sower/README.md +++ b/helm/sower/README.md @@ -181,5 +181,3 @@ A Helm chart for gen3 sower | volumeMounts | list | `[{"mountPath":"/sower_config.json","name":"sower-config","readOnly":true,"subPath":"sower_config.json"}]` | Volumes to mount to the container. | | volumes | list | `[{"configMap":{"items":[{"key":"json","path":"sower_config.json"}],"name":"manifest-sower"},"name":"sower-config"}]` | Volumes to attach to the container. | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/ssjdispatcher/README.md b/helm/ssjdispatcher/README.md index 53df78fc..3bb1ab0a 100644 --- a/helm/ssjdispatcher/README.md +++ b/helm/ssjdispatcher/README.md @@ -112,5 +112,3 @@ A Helm chart for gen3 ssjdispatcher | volumeMounts | list | `[{"mountPath":"/credentials.json","name":"ssjdispatcher-creds-volume","readOnly":true,"subPath":"credentials.json"}]` | Volumes to mount to the container. | | volumes | list | `[{"name":"ssjdispatcher-creds-volume","secret":{"secretName":"ssjdispatcher-creds"}}]` | Volumes to attach to the container. | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0) diff --git a/helm/test.yaml b/helm/test.yaml new file mode 100644 index 00000000..8b137891 --- /dev/null +++ b/helm/test.yaml @@ -0,0 +1 @@ + diff --git a/helm/wts/README.md b/helm/wts/README.md index 3e545b7e..f755b799 100644 --- a/helm/wts/README.md +++ b/helm/wts/README.md @@ -105,5 +105,3 @@ A Helm chart for gen3 workspace token service | serviceAccount.name | string | `""` | The name of the service account to use. If not set and create is true, a name is generated using the fullname template | | tolerations | list | `[]` | Tolerations for the pods | ----------------------------------------------- -Autogenerated from chart metadata using [helm-docs v1.11.0](https://github.com/norwoodj/helm-docs/releases/v1.11.0)