Add alerts for high osd cpu usage #2306

weirdwiz · 2023-12-06T11:00:37Z

This PR adds an alert, OSD is overwhelmed with a lot of performance request.

There were multiple metrics/values considered to recognise when the cluster is under load.

util% for the disks: The value for disk utilization is not very useful for devices with support for multiple queues like ssd and nvme. even if the utilization values are very high, the storage system can "take more work on"
disk latency: we can take a look at an average latency over time, but latency is highly dependent on IO_size
CPU usage: on cluster's under high load, the CPU usage of the primary OSD jumps up on a small cluster, increasing the amount of OSDs reduce the cpu usage.

in the end the PR just adds CPU usage as a factor, with a 35% usage as a threshold, which let's the user know to add more OSDs/increase the cluster size to aleviate the issue

metrics/mixin/alerts/perf.libsonnet

weirdwiz · 2023-12-08T10:36:50Z

PTAL @travisn @umangachapagain @aruniiird

sp98 · 2023-12-08T12:35:55Z

metrics/deploy/prometheus-ocs-rules.yaml

+        description: High CPU usage in the OSD container on node {{ $labels.pod }}. Please create more OSDs to increase performance
+        message: High CPU usage detected in OSD container on node {{ $labels.pod}}.
+        severity_level: warning
+      expr: "pod:container_cpu_usage:sum{pod=~\"rook-ceph-osd-.*\"} > 0.35 \n"


i've set it to 35%, because in all of my tests the OSD cpu usage, hovered around 20%ish. in very very overloaded environments, the osds steadly increased cpu usage till 30-40%

the throughput and latency was very low during the overloaded environment, only increased when i deployed a larger cluster (also resulting in lower CPU usage)

metrics/mixin/alerts/alerts.libsonnet

metrics/mixin/alerts/perf.libsonnet

jmolmo

After addressing suggested changes LGTM.

metrics/deploy/prometheus-ocs-rules.yaml

jmolmo

Remember you need also to add the related runbook. (...here the annotation label, in runbooks repo, the doc file.)

jmolmo · 2023-12-11T09:44:57Z

metrics/deploy/prometheus-ocs-rules.yaml

+      annotations:
+        description: High CPU usage in the OSD container on node {{ $labels.pod }}. Please create more OSDs to increase performance
+        message: High CPU usage detected in OSD container on node {{ $labels.pod}}.
+        severity_level: warning


please remember to add the new "runbook_url" annotation with the link to the alert doc file.

This can be a follow-up PR, since this is a new requirement.

Signed-off-by: Divyansh Kamboj <[email protected]>

weirdwiz · 2023-12-11T10:56:33Z

updated the PR with the changes

openshift-ci · 2023-12-11T12:00:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: umangachapagain, weirdwiz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [umangachapagain]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sp98 suggested changes Dec 8, 2023

View reviewed changes

metrics/mixin/alerts/perf.libsonnet Outdated Show resolved Hide resolved

metrics/mixin/alerts/perf.libsonnet Outdated Show resolved Hide resolved

openshift-ci bot assigned sp98 Dec 8, 2023

sp98 reviewed Dec 8, 2023

View reviewed changes

umangachapagain requested changes Dec 11, 2023

View reviewed changes

metrics/mixin/alerts/alerts.libsonnet Outdated Show resolved Hide resolved

metrics/mixin/alerts/perf.libsonnet Outdated Show resolved Hide resolved

metrics/mixin/alerts/perf.libsonnet Outdated Show resolved Hide resolved

metrics/mixin/alerts/perf.libsonnet Outdated Show resolved Hide resolved

openshift-ci bot assigned umangachapagain Dec 11, 2023

jmolmo suggested changes Dec 11, 2023

View reviewed changes

metrics/deploy/prometheus-ocs-rules.yaml Outdated Show resolved Hide resolved

metrics/deploy/prometheus-ocs-rules.yaml Outdated Show resolved Hide resolved

jmolmo reviewed Dec 11, 2023

View reviewed changes

Add alerts for high osd cpu usage

83e3981

Signed-off-by: Divyansh Kamboj <[email protected]>

weirdwiz force-pushed the osd-alert branch from 84e4c9c to 83e3981 Compare December 11, 2023 10:44

umangachapagain approved these changes Dec 11, 2023

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 11, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 11, 2023

openshift-merge-bot bot merged commit c707654 into red-hat-storage:main Dec 11, 2023
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alerts for high osd cpu usage #2306

Add alerts for high osd cpu usage #2306

weirdwiz commented Dec 6, 2023 •

edited

Loading

weirdwiz commented Dec 8, 2023

sp98 Dec 8, 2023

weirdwiz Dec 8, 2023 •

edited

Loading

jmolmo left a comment

jmolmo left a comment

jmolmo Dec 11, 2023

umangachapagain Dec 11, 2023

weirdwiz commented Dec 11, 2023

openshift-ci bot commented Dec 11, 2023

Add alerts for high osd cpu usage #2306

Add alerts for high osd cpu usage #2306

Conversation

weirdwiz commented Dec 6, 2023 • edited Loading

weirdwiz commented Dec 8, 2023

sp98 Dec 8, 2023

Choose a reason for hiding this comment

weirdwiz Dec 8, 2023 • edited Loading

Choose a reason for hiding this comment

jmolmo left a comment

Choose a reason for hiding this comment

jmolmo left a comment

Choose a reason for hiding this comment

jmolmo Dec 11, 2023

Choose a reason for hiding this comment

umangachapagain Dec 11, 2023

Choose a reason for hiding this comment

weirdwiz commented Dec 11, 2023

openshift-ci bot commented Dec 11, 2023

weirdwiz commented Dec 6, 2023 •

edited

Loading

weirdwiz Dec 8, 2023 •

edited

Loading