-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[receiver/kubeletstats] Add k8s.container.cpu.node.utilization
metric
#32295
[receiver/kubeletstats] Add k8s.container.cpu.node.utilization
metric
#32295
Conversation
cde2d37
to
05b1b5e
Compare
9efefe3
to
299b6b4
Compare
797d2dc
to
6450540
Compare
2d60221
to
e70ce18
Compare
eb2a93c
to
f0af96f
Compare
It looks like we'd be introducing more combinations of utilization metric for different resource types I understand one of the arguments for doing the utilization calculation in this receiver has been due to users wanting all resource usage metrics from the same receiver, and currently users need to run both the |
From my point of view, yes :)
Wouldn't that mean that we will have the same metric potentially emitted from 2 different receivers at the same time? In general, that's mostly covered already back in #24905 (comment) when the From my perspective agreeing on unambiguous metrics (even as optional) that we can provide ootb is the goal here. If we pass the responsibility to users/backends to do such calculations/namings then we kind of start losing the vendor/backend agnostic idea. Note: the |
f0af96f
to
5a5e2c2
Compare
Based on the discussions at #32295 (comment) I have pushed a change to rename the metric to @andrzej-stencel @povilasv @TylerHelmuth @dmitryax @jinja2 please take another look. |
9ac3346
to
42fc09e
Compare
k8s.container.cpu.node_limit_utilization
metric k8s.container.cpu.node.utilization
metric
3b543df
to
8b224eb
Compare
I think it is acceptable to have some metrics be reported by both
Tbh, I don't think this pre-computed metric adds much value for k8s users. If the goal is to make sure that users have the option of comparing usage against node's capacity then reporting just the capacity instead of the pre-computed utilization is a more flexible option, because it allows k8s admins and users to calculate "utilization" as they see fit for their k8s setup. I can think of a few scenarios where a k8s admin would calculate usage against node's cap/allo but these are not in-line with the metric being added here. For e.g., when looking at a node's utilization to determine whether an admin needs to add more capacity, I am more likely to sum the resource requests of all pods on the node and compare that against the node's allocatable instead of looking at the actual usage of the containers, which kind of does not matter because even if the container is using only 1 Gi out of a 10Gi request, the node has still reserved that 10Gi and is not usable for any other container on the node. As an application developer deploying to a k8s cluster, I am more likely to see the utilization wrt my pod's request/limit instead of the node's capacity. And if I have not set a limit on my container, the limit of resource such a container is technically allowed to use is the |
Thank's for the additional information! I guess users are always able to apply such calculations in their back-ends if they want to, even with the metrics that are provided today. In this, I don't see all these ideas really being blockers for this optional metric. For reference, this one has been used in Kibana for years now so being able to retrieve this ootb through the Collector would only be of benefit. Having the utilization calculated against a hard limit is a valuable, non relative, indicator to observe over time, build alerts/reports on top of etc. It can also come handy in order to quickly drill down to specific workloads while investigating infra resource utilization issues. Something to not underestimate is that query aggregations like those mentioned can become "expensive" sometimes, so I see value in having the option to do such pre-calculations on the edge. I would be more than happy to see any additional one, that we see as appropriate, in the future. Ultimately, providing this kind of metrics as optional is flexible enough since users that want to calculate them on their back-ends they can do so, while those that want them pre-calculated on the edge will still have the option for that. |
…metric Signed-off-by: ChrsMark <[email protected]>
Signed-off-by: ChrsMark <[email protected]>
8b224eb
to
343bb18
Compare
Signed-off-by: ChrsMark <[email protected]>
343bb18
to
ceb5d34
Compare
I agree with @ChrsMark that this metric is valuable. The discussion around naming and exact meaning (computing vs. node's capacity or allocatable cpu) would have been much easier if the semantic conventions were already specified for k8s. However, I don't think we should block adding new useful metrics to the collector until there are semantic conventions for them first. This is also not what we have been doing historically. Ideally, I would have each metric created by this receiver (and other receivers) document its semantic conventions status, or whether a semantic convention even exists for the metric. That would make it easier for users to reason about a specific metric's stability. Given that the receiver as a whole is in beta stability, we would like users to be aware of the fact that not all metrics in the component might be equally stable. If we'd want this to be done, that would be a separate PR that should not block this one or others. |
**Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> This PR adds the `k8s.pod.cpu.node.utilization` metric. Follow up from #32295 (comment) (cc @TylerHelmuth) . **Link to tracking Issue:** <Issue number if applicable> Related to #27885. **Testing:** <Describe what testing was performed and which tests were added.> Adjusted the respective unit test to cover this metric as well. **Documentation:** <Describe the documentation added.> Added Tested with a single container Pod: ![podCpu](https://github.com/open-telemetry/opentelemetry-collector-contrib/assets/11754898/9a0069c2-7077-4944-93b6-2dde00979bf3) --------- Signed-off-by: ChrsMark <[email protected]> Co-authored-by: Tiffany Hrabusa <[email protected]> Co-authored-by: Tyler Helmuth <[email protected]>
…ion` metrics (#33591) **Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> Similar to #32295 and #33390, this PR adds the `k8s.{container,pod}.memory.node.utilization` metrics. **Link to tracking Issue:** <Issue number if applicable> #27885 **Testing:** <Describe what testing was performed and which tests were added.> Added unit test. **Documentation:** <Describe the documentation added.> Added ### Manual testing 1. Using the following target Pod: ```yaml apiVersion: v1 kind: Pod metadata: name: memory-demo spec: containers: - name: memory-demo-ctr image: polinux/stress resources: requests: memory: "8070591Ki" limits: memory: "9070591Ki" command: ["stress"] args: ["--vm", "1", "--vm-bytes", "800M", "--vm-hang", "4"] ``` 2. ![memGood](https://github.com/open-telemetry/opentelemetry-collector-contrib/assets/11754898/fae04b30-59ca-4d70-8446-f54b5a085cf7) On a node of 32,5G memory the 800Mb container/Pod consumes the `0.8/32.5=0.0246...=0.025`. --------- Signed-off-by: ChrsMark <[email protected]>
Description:
At the moment. We calculate the
k8s.container.cpu_limit_utilization
as a ratio of the container's limits atopentelemetry-collector-contrib/receiver/kubeletstatsreceiver/internal/kubelet/cpu.go
Line 30 in 867d670
Similarly we can calculate the cpu utilization as ratio of the whole node's allocatable cpu, if we divide by the total number of node's cores.
We can retrieve this information from the Node's
Status.Capacity
, for example:Performance concerns
In order to get the Node's capacity we need an API call to the k8s API in order to get the Node object.
Something to consider here is the performance impact that this extra API call would bring. We can always choose to have this metric as disabled by default and clearly specify in the docs that this metric comes with an extra API call to get the Node of the Pods.
The good thing is that
kubeletstats
receiver target's only one node so I believe it's a safe assumption to only fetch the current node because all the observed Pods will belong to the one single local node. Correct me if I miss anything here.In addition, instead of performing the API call explicitly on every single
scrape
we can use an informer instead and leverage its cache. I can change this patch to this direction if we agree on this.Would love to hear other's opinions on this.
Todos
✅ 1) Apply this change behind a feature gate as it was indicated at #27885 (comment)
✅ 2) Use an Informer instead of direct API calls.
Link to tracking Issue:
ref: #27885
Testing:
I experimented with this approach and the results look correct. In order to verify this I deployed a stress Pod on my machine to consume a target cpu of 4 cores:
And then the collected
container.cpu.utilization
for that Pod's container was at0,5
as exepcted, based that my machine-node comes with 8 cores in total:Unit test is also included.
Documentation:
Added: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/32295/files#diff-8ad3b506fb1132c961e8da99b677abd31f0108e3f9ed6999dd96ad3297b51e08