Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add alerts for alloy-metrics #1417

Merged
merged 30 commits into from
Nov 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
96364d1
add sensible alerts for alloy
QuentinBisson Oct 29, 2024
9c2f655
wip - add ongoing alerts
QuentinBisson Oct 29, 2024
2f24521
Merge branch 'main' into alloy-monitoring
QuentinBisson Oct 29, 2024
55078dd
add dashboard annotation
QuentinBisson Oct 30, 2024
c111250
Merge branch 'main' into alloy-monitoring
QuentinBisson Oct 30, 2024
759ae7f
Update helm/prometheus-rules/templates/platform/atlas/alerting-rules/…
QuentinBisson Oct 30, 2024
9cee93a
Update prometheus.rules.yml
QuentinBisson Oct 30, 2024
14d67c3
Update helm/prometheus-rules/templates/platform/atlas/alerting-rules/…
QuentinBisson Oct 30, 2024
b9c1dea
Update helm/prometheus-rules/templates/platform/atlas/alerting-rules/…
QuentinBisson Oct 30, 2024
09929b0
Update helm/prometheus-rules/templates/platform/atlas/alerting-rules/…
QuentinBisson Oct 30, 2024
40452a5
add missing tests
QuentinBisson Oct 30, 2024
fbc9c8d
change based on ops-recipes
QuentinBisson Nov 4, 2024
a9713a5
Merge branch 'main' into alloy-monitoring
QuentinBisson Nov 4, 2024
9e72664
Clean up some rules a bit
QuentinBisson Nov 5, 2024
b7d53b3
Update CHANGELOG.md
QuentinBisson Nov 5, 2024
1d49161
Update helm-operations.rules.yml
QuentinBisson Nov 5, 2024
868779e
Update systemd.rules.yml
QuentinBisson Nov 5, 2024
c15aab7
Update helm/prometheus-rules/templates/platform/atlas/alerting-rules/…
QuentinBisson Nov 5, 2024
393738d
Update helm/prometheus-rules/templates/platform/atlas/alerting-rules/…
QuentinBisson Nov 5, 2024
54f9f72
Update helm/prometheus-rules/templates/platform/honeybadger/alerting-…
QuentinBisson Nov 5, 2024
bb9abda
Update helm/prometheus-rules/templates/platform/honeybadger/alerting-…
QuentinBisson Nov 5, 2024
068d45d
Update test/tests/providers/global/platform/honeybadger/alerting-rule…
QuentinBisson Nov 5, 2024
79dbfda
Merge branch 'alloy-monitoring' into alloy-monitoring-monitoring
QuentinBisson Nov 5, 2024
2f9c07c
add alerts for alloy-metrics
QuentinBisson Nov 5, 2024
263f93b
Merge branch 'alloy-monitoring-monitoring' into alerts-for-alloy-metrics
QuentinBisson Nov 5, 2024
70095ef
Merge branch 'main' into alerts-for-alloy-metrics
QuentinBisson Nov 7, 2024
4f9e241
improve monitoring agent down tests
QuentinBisson Nov 7, 2024
553a1a4
improve monitoring agent shards not satisfied tests
QuentinBisson Nov 7, 2024
c7b460b
Update test/tests/providers/global/platform/atlas/alerting-rules/allo…
QuentinBisson Nov 7, 2024
b476eac
Update test/tests/providers/global/platform/atlas/alerting-rules/allo…
QuentinBisson Nov 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `LoggingAgentDown` to be alerted when the logging agent is down.
- `LogForwardingErrors` to be alerted when the `loki.write` component is failing.
- `LogReceivingErrors` to be alerted when the `loki.source.api` components of the gateway is failing.
- `MonitoringAgentDown` to be alerted when the monitoring agent is down.
- `MonitoringAgentShardsNotSatisfied` to be alerted when the monitoring agent is missing any number of desired shards.

### Changed

Expand All @@ -23,6 +25,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `observability-gateway`
- Move all `grafana-cloud` related alerts to their own file.
- Move all alloy related alerts to the alloy alert file.
- Rename and move the following alerts as they are not specific to Prometheus:
- `PrometheusCriticalJobScrapingFailure` => `CriticalJobScrapingFailure`
- `PrometheusJobScrapingFailure` => `JobScrapingFailure`
- `PrometheusFailsToCommunicateWithRemoteStorageAPI` => `MetricForwardingErrors`

## [4.23.0] - 2024-10-30

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# This files describe common alloy alerting rules
# For alerts regarding monitoring and logging agents, please go to the respective files (logging.rules.yml and monitoring.rules.yml).
# For alerts regarding the monitoring pipeline and the logging pipeline, please go to the respective files (logging-pipeline.rules.yml and monitoring-pipeline.rules.yml).
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand Down Expand Up @@ -91,3 +91,103 @@ spec:
cancel_if_cluster_status_updating: "true"
cancel_if_node_unschedulable: "true"
cancel_if_node_not_ready: "true"
- name: alloy.metrics
rules:
# This alert pages if monitoring-agent fails to send samples to its remote write endpoint.
- alert: MonitoringAgentDown
annotations:
description: '{{`Monitoring agent fails to send samples.`}}'
summary: Monitoring agent fails to send samples to remote write endpoint.
opsrecipe: alloy/#monitoring-agent-down
dashboard: promRW001/prometheus-remote-write
expr: |-
count(
label_replace(
capi_cluster_status_condition{type="ControlPlaneReady", status="True"},
"cluster_id",
"$1",
"name",
"(.*)"
) == 1
) by (cluster_id, installation, pipeline, provider) > 0
unless on (cluster_id) (
count(up{job="alloy-metrics"} > 0) by (cluster_id)
)
for: 20m
labels:
area: platform
severity: page
team: atlas
topic: observability
inhibit_monitoring_agent_down: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_has_no_workers: "true"
## Same as MonitoringAgentDown, but triggers inhibition earlier and does not page.
- alert: InhibitionMonitoringAgentDown
annotations:
description: '{{`Monitoring agent fails to send samples.`}}'
summary: Monitoring agent fails to send samples to remote write endpoint.
opsrecipe: alloy/#monitoring-agent-down
dashboard: promRW001/prometheus-remote-write
expr: |-
count(
label_replace(
capi_cluster_status_condition{type="ControlPlaneReady", status="True"},
"cluster_id",
"$1",
"name",
"(.*)"
) == 1
) by (cluster_id, installation, pipeline, provider) > 0
unless on (cluster_id) (
count(up{job="alloy-metrics"} > 0) by (cluster_id)
)
for: 2m
labels:
area: platform
severity: none
team: atlas
topic: observability
inhibit_monitoring_agent_down: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
## This alert pages if any of the monitoring-agent shard is not running.
- alert: MonitoringAgentShardsNotSatisfied
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no unit test for this one and its associated inhibition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have forgotten them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added alerts for those :)

annotations:
description: '{{`At least one of the monitoring agent shard is missing.`}}'
summary: Monitoring agent is missing shards.
opsrecipe: alloy/#monitoring-agent-down
expr: |-
kube_statefulset_status_replicas{statefulset="alloy-metrics"}
- kube_statefulset_status_replicas_ready{statefulset="alloy-metrics"}
> 0
for: 40m
labels:
area: platform
severity: page
team: atlas
topic: observability
inhibit_monitoring_agent_down: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_outside_working_hours: "true"
## Same as MonitoringAgentShardsNotSatisfied but triggers inhibition earlier, and does not page.
- alert: InhibitionMonitoringAgentShardsNotSatisfied
annotations:
description: '{{`At least one of the monitoring agent shard is missing.`}}'
summary: Monitoring agent is missing shards.
opsrecipe: alloy/#monitoring-agent-down
expr: |-
kube_statefulset_status_replicas{statefulset="alloy-metrics"}
- kube_statefulset_status_replicas_ready{statefulset="alloy-metrics"}
> 0
for: 2m
labels:
area: platform
severity: none
team: atlas
topic: observability
inhibit_monitoring_agent_down: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
{{- include "labels.common" . | nindent 4 }}
name: monitoring-pipeline.rules
namespace: {{ .Values.namespace }}
spec:
groups:
- name: monitoring-pipeline
rules:
- alert: MetricForwardingErrors
annotations:
description: '{{`Monitoring agent can''t communicate with Remote Storage API at {{ $labels.url }}.`}}'
opsrecipe: monitoring-pipeline/
dashboard: promRW001/prometheus-remote-write
expr: |-
rate(prometheus_remote_storage_samples_failed_total[10m]) > 0.1
or rate(prometheus_remote_storage_samples_total[10m]) == 0
or rate(prometheus_remote_storage_metadata_retried_total[10m]) > 0
for: 1h
labels:
area: platform
cancel_if_outside_working_hours: "true"
severity: page
team: atlas
topic: observability
- alert: JobScrapingFailure
annotations:
dashboard: servicemonitors-details/servicemonitors-details
description: '{{`Monitoring agents for cluster {{$labels.installation}}/{{$labels.cluster_id}} has failed to scrape all targets in {{$labels.job}} job.`}}'
summary: Monitoring agent failed to scrape all targets in a job.
opsrecipe: monitoring-job-scraping-failure/
hervenicol marked this conversation as resolved.
Show resolved Hide resolved
expr: |-
(
count(up == 0) by (job, installation, cluster_id, provider, pipeline)
/
count(up) by (job, installation, cluster_id, provider, pipeline)
) >= 1
for: 1d
labels:
area: platform
severity: notify
team: atlas
topic: observability
cancel_if_outside_working_hours: "true"
- alert: CriticalJobScrapingFailure
annotations:
dashboard: servicemonitors-details/servicemonitors-details
description: '{{`Monitoring agents for cluster {{$labels.installation}}/{{$labels.cluster_id}} has failed to scrape all targets in {{$labels.job}} job.`}}'
summary: Monitoring agent failed to scrape all targets in a job.
opsrecipe: monitoring-job-scraping-failure/
## We ignore bastion hosts node exporters
expr: |-
(
count(
(
up{job=~".*(apiserver|kube-controller-manager|kube-scheduler|node-exporter|kube-state-metrics).*"}
or
up{job="kubelet", metrics_path="/metrics"}
) == 0
) by (job, installation, cluster_id, provider, pipeline)
/
count(
up{job=~".*(apiserver|kube-controller-manager|kube-scheduler|node-exporter|kube-state-metrics).*"}
or
up{job="kubelet", metrics_path="/metrics"}
) by (job, installation, cluster_id, provider, pipeline)
) >= 1
for: 3d
labels:
area: platform
severity: page
team: atlas
topic: observability
cancel_if_outside_working_hours: "true"
cancel_if_cluster_is_not_running_monitoring_agent: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"

Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# TODO(@giantswarm/team-atlas): revisit once vintage is gone
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand Down Expand Up @@ -26,19 +27,6 @@ spec:
severity: page
team: atlas
topic: observability
- alert: PrometheusFailsToCommunicateWithRemoteStorageAPI
annotations:
description: '{{`Prometheus can''t communicate with Remote Storage API at {{ $labels.url }}.`}}'
opsrecipe: prometheus-cant-communicate-with-remote-storage-api/
dashboard: promRW001/prometheus-remote-write
expr: rate(prometheus_remote_storage_samples_failed_total[10m]) > 0.1 or rate(prometheus_remote_storage_samples_total[10m]) == 0 or rate(prometheus_remote_storage_metadata_retried_total[10m]) > 0
for: 1h
labels:
area: platform
cancel_if_outside_working_hours: "true"
severity: page
team: atlas
topic: observability
- alert: PrometheusRuleFailures
annotations:
description: {{`Prometheus {{$labels.installation}}/{{$labels.cluster_id}} has failed to evaluate rule(s) {{ printf "%.2f" $value }} time(s).`}}
Expand All @@ -52,48 +40,3 @@ spec:
team: atlas
topic: observability
cancel_if_outside_working_hours: "true"
- alert: PrometheusJobScrapingFailure
annotations:
description: {{`Prometheus {{$labels.installation}}/{{$labels.cluster_id}} has failed to scrape all targets in {{$labels.job}} job.`}}
summary: Prometheus fails to scrape all targets in a job.
opsrecipe: prometheus-job-scraping-failure/
expr: (count(up == 0) BY (job, installation, cluster_id, provider, pipeline) / count(up) BY (job, installation, cluster_id, provider, pipeline)) == 1
for: 1d
labels:
area: platform
severity: notify
team: atlas
topic: observability
cancel_if_outside_working_hours: "true"
- alert: PrometheusCriticalJobScrapingFailure
annotations:
description: {{`Prometheus {{$labels.installation}}/{{$labels.cluster_id}} has failed to scrape all targets in {{$labels.job}} job.`}}
summary: Prometheus fails to scrape all targets in a job.
opsrecipe: prometheus-job-scraping-failure/
## We ignore bastion hosts node exporters
expr: |-
(
count(
(
up{job=~"apiserver|kube-controller-manager|kube-scheduler|node-exporter|kube-state-metrics"}
or
up{job="kubelet", metrics_path="/metrics"}
) == 0
) BY (job, installation, cluster_id, provider, pipeline)
/
count(
up{job=~"apiserver|kube-controller-manager|kube-scheduler|node-exporter|kube-state-metrics"}
or
up{job="kubelet", metrics_path="/metrics"}
) BY (job, installation, cluster_id, provider, pipeline)
) == 1
for: 3d
labels:
area: platform
severity: page
team: atlas
topic: observability
cancel_if_outside_working_hours: "true"
cancel_if_cluster_is_not_running_monitoring_agent: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
Loading