Define semantic conventions for k8s metrics #1032

ChrsMark · 2024-05-13T08:58:13Z

Area(s)

area:k8s

Is your change request related to a problem? Please describe.

At the moment there are not Semantic Conventions for k8s metrics.

Describe the solution you'd like

Even if we cannot consider the k8s metrics as stable we can start considering adding metrics that are not controversial to get some progress here. This issue aims to collect the existing k8s metrics that exist in the Collector and keep track of any related work.
Bellow I'm providing an initial list with metrics coming from the kubeletstats and k8scluster receivers. Note that these are matter to change with time being so we should get back to the Collector to verify the current state.

cc: @open-telemetry/semconv-k8s-approvers

Describe alternatives you've considered

No response

Additional context

Below there are some metrics from namespaces other than k8s.* as well. I leave them in there intentionally in order to take them into account accordingly.

kubeletstats metrics

cpu metrics: #1489
memory metrics: #1490
filesystem metrics: #1488
network metrics: #1487 ✅
uptime metrics: #1486 ✅
volume metrics: #1485

k8scluster metrics

deployment metrics: #1636 ✅

cronjob metrics: #1660

k8s.cronjob.active_jobs

daemonset metrics: #1649 ✅

k8s.daemonset.current_scheduled_nodes
k8s.daemonset.desired_scheduled_nodes
k8s.daemonset.misscheduled_nodes
k8s.daemonset.ready_nodes

hpa metrics: #1644 ✅

k8s.hpa.max_replicas
k8s.hpa.min_replicas
k8s.hpa.current_replicas
k8s.hpa.desired_replicas

job metrics: #1660

k8s.job.active_pods
k8s.job.desired_successful_pods
k8s.job.failed_pods
k8s.job.max_parallel_pods
k8s.job.successful_pods

namespace metrics: #1668

k8s.namespace.phase

replicaset metrics: #1636 ✅

k8s.replicaset.desired
k8s.replicaset.available

replication_controller metrics #1636 ✅

k8s.replication_controller.desired
k8s.replication_controller.available

statefulset metrics: #1637 ✅

k8s.statefulset.desired_pods
k8s.statefulset.ready_pods
k8s.statefulset.current_pods
k8s.statefulset.updated_pods

container metrics

k8s.container.cpu_request
k8s.container.cpu_limit
k8s.container.memory_request
k8s.container.memory_limit
k8s.container.storage_request
k8s.container.storage_limit
k8s.container.ephemeralstorage_request
k8s.container.ephemeralstorage_limit
k8s.container.restarts
k8s.container.ready

pod metrics

k8s.pod.phase
k8s.pod.status_reason

resource_quota metrics

k8s.resource_quota.hard_limit
k8s.resource_quota.used

node metrics

k8s.node.condition

related issue: open-telemetry/opentelemetry-collector-contrib#33760

Openshift metrics

openshift.clusterquota.limit
openshift.clusterquota.used
openshift.appliedclusterquota.limit
openshift.appliedclusterquota.used

Related issues

TBA

The text was updated successfully, but these errors were encountered:

TylerHelmuth · 2024-05-13T17:38:15Z

I love the idea of moving forward with this work. According to the collector end-user survey k8s and the collector are a big part of our end-user's stack, so moving the related semconvs forwards is a great idea.

sirianni · 2024-05-16T19:34:39Z

In general, my team has been happy with the metrics collected by kubeletstatsreceiver and how they are modeled. They are struggling significantly with the "state" metrics that come from k8sclusterreceiver. We are coming from a Datadog background.

@dmitryax

#33598) Having recently been working with the `kubeletstats` receiver (using it and contributing to it), I would like to volunteer to help with its maintainance by intending to dedicate time to contribute to the component as well as help with the existing and future issue queue. Also being a member of the [semconv-k8s-approvers](https://github.com/orgs/open-telemetry/teams/semconv-k8s-approvers) and [semconv-container-approvers](https://github.com/orgs/open-telemetry/teams/semconv-container-approvers) will help to bring more alignment between the [Semantic Conventions](open-telemetry/semantic-conventions#1032) and the Collector's implementation within this specific scope. - ✅ Being a member of Opentelemetry organization - PRs authored: https://github.com/open-telemetry/opentelemetry-collector-contrib/pulls?q=is%3Apr+author%3AChrsMark++label%3Areceiver%2Fkubeletstats%2Cinternal%2Fkubeletstats%2Cinternal%2Fkubelet - Issues have been involved: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues?q=is%3Aissue+commenter%3AChrsMark+label%3Areceiver%2Fkubeletstats%2Cinternal%2Fkubeletstats+ /cc @dmitryax @TylerHelmuth with whom I have already discussed about it Signed-off-by: ChrsMark <[email protected]>

ChrsMark · 2024-07-12T12:43:14Z

I have updated the description to group metrics together in a meaningful way.
For kubeletstats the grouping is per resource type (cpu, memory, network etc).
For k8sclusterrecevier metrics the grouping is per K8s Resource type (pod, deployment etc).

I hope this makes the list less overwhelming and people willing to help on this could pick up a group all together and work on it. Maybe we could create standalone issues per group if that helps, link them here to simplify the list in this issue's description and use this issue as a meta issue.

mx-psi · 2024-07-25T14:52:59Z

I removed this from the system semantic conventions WG since this WG does not handle Kubernetes-related semantic conventions

ChrsMark · 2024-10-17T11:49:30Z

Reading through the k8scluster metrics section I realize that ~~most~~some of them most probably would better fit as Resource Attributes?

However in the Collector we emit them as metrics. For example https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/e7ebc6e1676aa661880a09b0ff93a9cccad8f011/receiver/k8sclusterreceiver/testdata/e2e/expected.yaml#L703-L709
while in SemConv it is already a resource attribute https://github.com/open-telemetry/semantic-conventions/blob/main/docs/resource/k8s.md#container?

@povilasv @TylerHelmuth do you have more context on how/why in the collector those were implemented as metrics?

TylerHelmuth · 2024-10-17T17:26:41Z

@dmitryax might know

SylvainJuge · 2024-10-17T19:50:18Z

Hi! 👋

This issue was mentioned today during the Java SIG meeting as we now have the ability to capture "state metrics" which have a similar structure as defined in the Hardware semconv with the hw.status attribute: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/

When browsing the definitions of k8sclusterreceiver I found at least one metric k8s.pod.phase where the value encodes the state (link).

Maybe using a similar modeling to what we use in Java and HW Semconv could be relevant here.

ChrsMark · 2024-10-18T08:06:29Z

Thank's @SylvainJuge ! I think we could use a similar modeling here for .status, .phase, .condition ones.

povilasv · 2024-10-18T11:20:41Z

I think both Resource attribute and a gauge metric tracking historic usage and it's state change is a useful thing.

Resource attribute is basically same thing as kubectl describe pod and looking at Restart Count: X field.
While metric, allows you to see the history, when the restart happened, maybe add an alert on the metric. So IMO both are useful.

sirianni · 2024-10-18T14:56:31Z

Reading through the k8scluster metrics section I realize that most of them most probably would better fit as Resource Attributes?

Resource attribute is basically same thing as kubectl describe pod and looking at Restart Count: X field.

OTel defines three signals - metrics, logs, traces. Resource attributes are metadata attached to those signals. I don't understand the discussion above regarding "use resource attributes instead of metrics". Resource attributes are not first-class things that exist independent of the core signals.

ChrsMark · 2024-10-21T08:49:51Z

My confusion was mainly because of the k8s.container.restarts being a Resource Attribute in SemConv while also a metric in the Collector. (maybe I generalized wrongly here, apologies for that)

For this specific one what @povilasv mentioned makes sense since we can model container's restarts as a metric but it can also be used as an identifier. This happens already for logs parsed with the container parser where the container restart count is a Resource Attribute of the log record: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/stanza/docs/operators/container.md#add-metadata-from-file-path.

Probably we need to name these 2 differently to avoid confusion. I don't know if we have hit something similar in SemConv so far.

For the rest of the list I think we should be fine taking also into account #1032 (comment).

ChrsMark · 2024-10-22T14:38:15Z

Reading through the k8scluster metrics section I realize that most of them most probably would better fit as Resource Attributes?

Resource attribute is basically same thing as kubectl describe pod and looking at Restart Count: X field.

OTel defines three signals - metrics, logs, traces. Resource attributes are metadata attached to those signals. I don't understand the discussion above regarding "use resource attributes instead of metrics". Resource attributes are not first-class things that exist independent of the core signals.

In general, I see that the k8scluster receiver will start emitting Entity signals as well. These are already emitted as Logs today: open-telemetry/opentelemetry-collector-contrib#24419. But I believe this will not affect the current list of emitted metrics.

ChrsMark · 2024-11-04T16:27:51Z

Hi! 👋

This issue was mentioned today during the Java SIG meeting as we now have the ability to capture "state metrics" which have a similar structure as defined in the Hardware semconv with the hw.status attribute: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/

When browsing the definitions of k8sclusterreceiver I found at least one metric k8s.pod.phase where the value encodes the state (link).

Maybe using a similar modeling to what we use in Java and HW Semconv could be relevant here.

That was discussed today in SemConv SIG meeting (Nov 4, 2024). It seems that this modeling could fit well here and in #1212. @braydonk will prepare a proposal to put this as generic guidance in Semantic Conventions (thank you Braydon :)). Based on the outcome of this we can unblock the related PRs.

dmitryax · 2024-11-17T00:30:18Z

My confusion was mainly because of the k8s.container.restarts being a Resource Attribute in SemConv while also a metric in the Collector. (maybe I generalized wrongly here, apologies for that)

That resource attribute being used to identify a particular container instance in a pod when we scrape log from. Container logs are written in files with the following pattern: <namespace>_<pod_name>_<pod_uid>/<container_name>/<restart_count>.log. The last part, restart_count, identifies a particular container instance. It can be called a container run counter or something like that. This is a valid attribute.

ChrsMark added enhancement New feature or request experts needed This issue or pull request is outside an area where general approvers feel they can approve triage:needs-triage labels May 13, 2024

github-actions bot assigned reyang May 13, 2024

github-actions bot added the area:k8s label May 13, 2024

TylerHelmuth mentioned this issue May 16, 2024

Record pod Ready status, or all pod status conditions open-telemetry/opentelemetry-collector-contrib#32941

Open

ChrsMark mentioned this issue May 29, 2024

[receiver/kubeletstats] Add k8s.container.cpu.node.utilization metric open-telemetry/opentelemetry-collector-contrib#32295

Merged

This was referenced Jun 6, 2024

Add container.cpu.usage metric #1128

Merged

k8s: new attributes: CSI driver and volume handle #1119

Closed

ChrsMark mentioned this issue Jun 17, 2024

[chore][receiver/kubeletstats] Add ChrsMark to kubeletstats codeowners open-telemetry/opentelemetry-collector-contrib#33598

Merged

lmolkova mentioned this issue Jun 17, 2024

Guidance needed: process vs system vs container vs k8s vs runtime metrics #1161

Open

joaopgrassi added this to System Semantic Convention Working Group Jul 9, 2024

joaopgrassi removed the triage:needs-triage label Jul 9, 2024

mx-psi removed this from System Semantic Convention Working Group Jul 25, 2024

ChrsMark mentioned this issue Aug 6, 2024

Add k8s.{pod,node}.cpu.{time,usage} metrics #1320

Merged

3 tasks

ChrsMark mentioned this issue Sep 16, 2024

Add k8s.{pod,node}.memory.usage metrics #1406

Merged

3 tasks

ChrsMark mentioned this issue Sep 24, 2024

Add k8s.{pod,node}.network.{io,errors} metrics #1427

Merged

3 tasks

ChrsMark moved this to Todo in K8s SemConv SIG Oct 2, 2024

ChrsMark added this to K8s SemConv SIG Oct 2, 2024

TylerHelmuth mentioned this issue Oct 8, 2024

feat: [receiver/k8scluster] Add optional k8s.container.status.waiting metric open-telemetry/opentelemetry-collector-contrib#35668

Closed

ChrsMark mentioned this issue Oct 17, 2024

[k8s] Define semantic conventions for k8s volume metrics #1485

Open

trask mentioned this issue Oct 17, 2024

jmx state metrics open-telemetry/opentelemetry-java-instrumentation#12369

Merged

This was referenced Oct 31, 2024

Add container.health and container.status to container attributes #1515

Open

process: add process.status attribute #1212

Closed

ChrsMark mentioned this issue Nov 7, 2024

Create guideline for modeling state and phase as metrics #1554

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define semantic conventions for k8s metrics #1032

Define semantic conventions for k8s metrics #1032

ChrsMark commented May 13, 2024 •

edited

Loading

TylerHelmuth commented May 13, 2024

sirianni commented May 16, 2024

ChrsMark commented Jul 12, 2024

mx-psi commented Jul 25, 2024

ChrsMark commented Oct 17, 2024 •

edited

Loading

TylerHelmuth commented Oct 17, 2024

SylvainJuge commented Oct 17, 2024

ChrsMark commented Oct 18, 2024

povilasv commented Oct 18, 2024

sirianni commented Oct 18, 2024

ChrsMark commented Oct 21, 2024

ChrsMark commented Oct 22, 2024

ChrsMark commented Nov 4, 2024

dmitryax commented Nov 17, 2024

Define semantic conventions for k8s metrics #1032

Define semantic conventions for k8s metrics #1032

Comments

ChrsMark commented May 13, 2024 • edited Loading

Area(s)

Is your change request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

kubeletstats metrics

k8scluster metrics

deployment metrics: #1636 ✅

cronjob metrics: #1660

daemonset metrics: #1649 ✅

hpa metrics: #1644 ✅

job metrics: #1660

namespace metrics: #1668

replicaset metrics: #1636 ✅

replication_controller metrics #1636 ✅

statefulset metrics: #1637 ✅

container metrics

pod metrics

resource_quota metrics

node metrics

Openshift metrics

Related issues

TylerHelmuth commented May 13, 2024

sirianni commented May 16, 2024

ChrsMark commented Jul 12, 2024

mx-psi commented Jul 25, 2024

ChrsMark commented Oct 17, 2024 • edited Loading

TylerHelmuth commented Oct 17, 2024

SylvainJuge commented Oct 17, 2024

ChrsMark commented Oct 18, 2024

povilasv commented Oct 18, 2024

sirianni commented Oct 18, 2024

ChrsMark commented Oct 21, 2024

ChrsMark commented Oct 22, 2024

ChrsMark commented Nov 4, 2024

dmitryax commented Nov 17, 2024

ChrsMark commented May 13, 2024 •

edited

Loading

ChrsMark commented Oct 17, 2024 •

edited

Loading