Skip to content

Commit

Permalink
remove UpdateBucketCapacityJobTakingTooLong alert
Browse files Browse the repository at this point in the history
UpdateBucketCapacityJobTakingTooLong creates noise and is not useful
as what we truly want to monitor is having multiple jobs failing in
series.

NoSuccessfulUpdateBucketCapacityJobRunIn10m already tests that as jobs
are only run one at a time and at a 4 minute interval.

Issue: S3UTILS-150
  • Loading branch information
Kerkesni committed Nov 15, 2023
1 parent 3f11858 commit b6bcac8
Show file tree
Hide file tree
Showing 2 changed files with 0 additions and 42 deletions.
27 changes: 0 additions & 27 deletions monitoring/update-bucket-capacity-info-cronjob/alerts.test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,33 +3,6 @@ rule_files:
- alerts.rendered.yaml

tests:
- name: Update Bucket Capacity Info CronJob Test Taking Too Long
interval: 1m
input_series:
- series: 'kube_job_status_start_time{job="kube-state-metrics", job_name="artesca-data-ops-update-bucket-capacity-info-1", namespace="zenko"}'
values: '0x9'
- series: 'kube_job_status_completion_time{job="kube-state-metrics", job_name="artesca-data-ops-update-bucket-capacity-info-1", namespace="zenko"}'
values: '_x4 240x5'
- series: 'kube_job_status_start_time{job="kube-state-metrics", job_name="artesca-data-ops-update-bucket-capacity-info-2", namespace="zenko"}'
values: '_x4 240x5'
- series: 'kube_job_status_completion_time{job="kube-state-metrics", job_name="artesca-data-ops-update-bucket-capacity-info-2", namespace="zenko"}'
values: '_x9 '
alert_rule_test:
- alertname: UpdateBucketCapacityJobTakingTooLong
eval_time: 4m
exp_alerts: []
- alertname: UpdateBucketCapacityJobTakingTooLong
eval_time: 9m
exp_alerts:
- exp_labels:
severity: warning
job_name: artesca-data-ops-update-bucket-capacity-info-1
exp_annotations:
description: |
Job artesca-data-ops-update-bucket-capacity-info is taking more than 240s to complete.
This may cause bucket capacity to be out of date and Veeam SOSAPI avalability as risk.
summary: update-bucket-capacity-info cronjob takes too long to finish

- name: Update Bucket Capacity Info CronJob Test No Success in 10m
interval: 1m
input_series:
Expand Down
15 changes: 0 additions & 15 deletions monitoring/update-bucket-capacity-info-cronjob/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,21 +16,6 @@ x-inputs:
groups:
- name: update-bucket-capacity-info-cronjob/alerts.rules
rules:
- alert: UpdateBucketCapacityJobTakingTooLong
expr: |
time() -
(sum by(job_name) (kube_job_status_failed{job_name=~"${update_bucket_capacity_info_cronjob}.*"})
> sum by(job_name) (kube_job_status_completion_time{job_name=~"${update_bucket_capacity_info_cronjob}.*"})
or sum by(job_name) (kube_job_status_completion_time{job_name=~"${update_bucket_capacity_info_cronjob}.*"}))
> ${update_bucket_capacity_info_job_duration_threshold}
labels:
severity: warning
annotations:
description: |
Job ${update_bucket_capacity_info_cronjob} is taking more than ${update_bucket_capacity_info_job_duration_threshold}s to complete.
This may cause bucket capacity to be out of date and Veeam SOSAPI avalability as risk.
summary: update-bucket-capacity-info cronjob takes too long to finish

- alert: NoSuccessfulUpdateBucketCapacityJobRunIn10m
expr: |
time()
Expand Down

0 comments on commit b6bcac8

Please sign in to comment.