-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alerts for high osd cpu usage #2306
Conversation
description: High CPU usage in the OSD container on node {{ $labels.pod }}. Please create more OSDs to increase performance | ||
message: High CPU usage detected in OSD container on node {{ $labels.pod}}. | ||
severity_level: warning | ||
expr: "pod:container_cpu_usage:sum{pod=~\"rook-ceph-osd-.*\"} > 0.35 \n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 0.35?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've set it to 35%, because in all of my tests the OSD cpu usage, hovered around 20%ish. in very very overloaded environments, the osds steadly increased cpu usage till 30-40%
the throughput and latency was very low during the overloaded environment, only increased when i deployed a larger cluster (also resulting in lower CPU usage)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After addressing suggested changes LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember you need also to add the related runbook. (...here the annotation label, in runbooks repo, the doc file.)
annotations: | ||
description: High CPU usage in the OSD container on node {{ $labels.pod }}. Please create more OSDs to increase performance | ||
message: High CPU usage detected in OSD container on node {{ $labels.pod}}. | ||
severity_level: warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remember to add the new "runbook_url" annotation with the link to the alert doc file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a follow-up PR, since this is a new requirement.
Signed-off-by: Divyansh Kamboj <[email protected]>
updated the PR with the changes |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: umangachapagain, weirdwiz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
c707654
into
red-hat-storage:main
This PR adds an alert, OSD is overwhelmed with a lot of performance request.
There were multiple metrics/values considered to recognise when the cluster is under load.
util%
for the disks: The value for disk utilization is not very useful for devices with support for multiple queues like ssd and nvme. even if the utilization values are very high, the storage system can "take more work on"disk latency: we can take a look at an average latency over time, but latency is highly dependent on IO_size
CPU usage: on cluster's under high load, the CPU usage of the primary OSD jumps up on a small cluster, increasing the amount of OSDs reduce the cpu usage.
in the end the PR just adds CPU usage as a factor, with a 35% usage as a threshold, which let's the user know to add more OSDs/increase the cluster size to aleviate the issue