Skip to content

Commit

Permalink
Add alert rules to jupyter-controller based on the KF093 spec (#402)
Browse files Browse the repository at this point in the history
* Add alert rules to jupyter-controller based on the KF093 spec

* Delete charms/jupyter-controller/src/prometheus_alert_rules/unit_unavailable.rule

* fix unit tests
  • Loading branch information
rgildein authored and misohu committed Oct 8, 2024
1 parent 360b207 commit 4dff6b8
Show file tree
Hide file tree
Showing 3 changed files with 28 additions and 12 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
groups:
- name: KubeflowJupyterControllerServices
rules:
- alert: KubeflowServiceDown
expr: up{} < 1
for: 5m
labels:
severity: critical
annotations:
summary: "{{ $labels.juju_charm }} service is Down ({{ $labels.juju_model }}/{{ $labels.juju_unit }})"
description: |
One or more targets of {{ $labels.juju_charm }} charm are down on unit {{ $labels.juju_model }}/{{ $labels.juju_unit }}.
LABELS = {{ $labels }}

- alert: KubeflowServiceIsNotStable
expr: avg_over_time(up{}[10m]) < 0.5
for: 0m
labels:
severity: warning
annotations:
summary: "{{ $labels.juju_charm }} service is not stable ({{ $labels.juju_model }}/{{ $labels.juju_unit }})"
description: |
{{ $labels.juju_charm }} unit {{ $labels.juju_model }}/{{ $labels.juju_unit }} has been unreachable at least 50% of the time over the last 10 minutes.
LABELS = {{ $labels }}

This file was deleted.

6 changes: 4 additions & 2 deletions charms/jupyter-controller/tests/unit/test_operator.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,9 +115,11 @@ def test_prometheus_data_set(self, harness: Harness, mocker):
with open("src/prometheus_alert_rules/model_errors.rule") as f:
file_alert = yaml.safe_load(f.read())
test_alerts.append(file_alert["alert"])
with open("src/prometheus_alert_rules/unit_unavailable.rule") as f:
with open("src/prometheus_alert_rules/KubeflowJupyterControllerServices.rules") as f:
file_alert = yaml.safe_load(f.read())
test_alerts.append(file_alert["alert"])
# there 2 alert rules in host_resources.rules
for rule in file_alert["groups"][0]["rules"]:
test_alerts.append(rule["alert"])

# alert rules
alert_rules = json.loads(
Expand Down

0 comments on commit 4dff6b8

Please sign in to comment.