Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alerts for high osd cpu usage #2306

Merged
merged 1 commit into from
Dec 11, 2023

Conversation

weirdwiz
Copy link
Contributor

@weirdwiz weirdwiz commented Dec 6, 2023

This PR adds an alert, OSD is overwhelmed with a lot of performance request.

There were multiple metrics/values considered to recognise when the cluster is under load.

  1. util% for the disks: The value for disk utilization is not very useful for devices with support for multiple queues like ssd and nvme. even if the utilization values are very high, the storage system can "take more work on"

  2. disk latency: we can take a look at an average latency over time, but latency is highly dependent on IO_size

  3. CPU usage: on cluster's under high load, the CPU usage of the primary OSD jumps up on a small cluster, increasing the amount of OSDs reduce the cpu usage.

in the end the PR just adds CPU usage as a factor, with a 35% usage as a threshold, which let's the user know to add more OSDs/increase the cluster size to aleviate the issue

metrics/mixin/alerts/perf.libsonnet Outdated Show resolved Hide resolved
metrics/mixin/alerts/perf.libsonnet Outdated Show resolved Hide resolved
@weirdwiz
Copy link
Contributor Author

weirdwiz commented Dec 8, 2023

PTAL @travisn @umangachapagain @aruniiird

description: High CPU usage in the OSD container on node {{ $labels.pod }}. Please create more OSDs to increase performance
message: High CPU usage detected in OSD container on node {{ $labels.pod}}.
severity_level: warning
expr: "pod:container_cpu_usage:sum{pod=~\"rook-ceph-osd-.*\"} > 0.35 \n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 0.35?

Copy link
Contributor Author

@weirdwiz weirdwiz Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've set it to 35%, because in all of my tests the OSD cpu usage, hovered around 20%ish. in very very overloaded environments, the osds steadly increased cpu usage till 30-40%

the throughput and latency was very low during the overloaded environment, only increased when i deployed a larger cluster (also resulting in lower CPU usage)

metrics/mixin/alerts/alerts.libsonnet Outdated Show resolved Hide resolved
metrics/mixin/alerts/perf.libsonnet Outdated Show resolved Hide resolved
metrics/mixin/alerts/perf.libsonnet Outdated Show resolved Hide resolved
metrics/mixin/alerts/perf.libsonnet Outdated Show resolved Hide resolved
Copy link
Member

@jmolmo jmolmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After addressing suggested changes LGTM.

metrics/deploy/prometheus-ocs-rules.yaml Outdated Show resolved Hide resolved
metrics/deploy/prometheus-ocs-rules.yaml Outdated Show resolved Hide resolved
Copy link
Member

@jmolmo jmolmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember you need also to add the related runbook. (...here the annotation label, in runbooks repo, the doc file.)

annotations:
description: High CPU usage in the OSD container on node {{ $labels.pod }}. Please create more OSDs to increase performance
message: High CPU usage detected in OSD container on node {{ $labels.pod}}.
severity_level: warning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remember to add the new "runbook_url" annotation with the link to the alert doc file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a follow-up PR, since this is a new requirement.

Signed-off-by: Divyansh Kamboj <[email protected]>
@weirdwiz
Copy link
Contributor Author

updated the PR with the changes

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 11, 2023
Copy link
Contributor

openshift-ci bot commented Dec 11, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: umangachapagain, weirdwiz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 11, 2023
@openshift-merge-bot openshift-merge-bot bot merged commit c707654 into red-hat-storage:main Dec 11, 2023
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants