From 2af78dbb08bdfa77c44c54b1b86682f933f2ab82 Mon Sep 17 00:00:00 2001 From: braydonk Date: Tue, 26 Nov 2024 14:58:39 +0000 Subject: [PATCH 1/7] System Semantic Conventions Non-Normative Guidance This PR adds non-normative guidance from the System Semantic Conventions Working Group. This is added in a new `groups` folder in `non-normative`, and a `system` subfolder in `groups`. The docs written here were already discussed in a Google doc where we were originally collaborating on this, a link to which can be shared directly if needed. --- .../groups/system/design-philosophy.md | 94 +++++++++++++++++++ docs/non-normative/groups/system/use-cases.md | 77 +++++++++++++++ 2 files changed, 171 insertions(+) create mode 100644 docs/non-normative/groups/system/design-philosophy.md create mode 100644 docs/non-normative/groups/system/use-cases.md diff --git a/docs/non-normative/groups/system/design-philosophy.md b/docs/non-normative/groups/system/design-philosophy.md new file mode 100644 index 0000000000..ec73bfe5eb --- /dev/null +++ b/docs/non-normative/groups/system/design-philosophy.md @@ -0,0 +1,94 @@ +# **System Semantic Conventions: Instrumentation Design Philosophy** + +The System Semantic Conventions are caught in a strange dichotomy that is unique among other semconv groups. While we want to make sure we cover obvious generic use cases, monitoring system health is a very old practice with lots of different existing strategies. While we can cover the basic use cases in cross platform ways, we want to make sure that users who specialize in certain platforms aren't left in the lurch; if users aren't given recommendations for particular types of data that isn't cross-platform and universal,they may come up with their own disparate ideas for how that instrumentation should look, leading to the kind of fracturing that the semantic conventions should be in place to avoid. + +The following sections address some of the most common instrumentation design questions, and how we as a working group have opted to address them. In some cases they are unique to the common semantic conventions guidance due to our unique circumstance, and those cases will be called out specifically. + +## Namespaces + +Relevant discussions: [\#1161](https://github.com/open-telemetry/semantic-conventions/issues/1161) + +The System Semantic Conventions generally cover the following namespaces: + +* `system` +* `process` +* `host` +* `memory` +* `network` +* `disk` +* `memory` +* `os` + +Deciding on the namespace of a metric/attribute is generally informed by the following belief: + +**The namespace of a metric/attribute should logically map to the Operating System concept being considered the instrumentation source.** + +The most obvious example of this is with language runtime metrics and `process` namespace metrics. Many of these metrics are very similar; most language runtimes provide some manner of `cpu.time`, `memory.usage` and similar metrics. If we were considering de-duplication as the top value in our design, it would follow that `process.cpu.time` and `process.memory.usage` should simply be referenced by any language runtime that might produce those metrics. However, as a working group we believe it is important that `process` namespace and runtime namespace metrics remain separate, because `process` metrics are meant to represent an **OS-level process as the instrumentation source**, whereas runtime metrics represent **the language runtime as the instrumentation source**. + +In some cases this is simply a matter of making the instrumentation's purpose as clear as possible, but there are cases where attempts to share definitions across distinct instrumentation sources poses the potential for a clash. The concrete example of a time we accepted this consequence is with `cpu.mode`; the decision was to [unify all separate instances of `*.cpu.state` attributes into one shared `cpu.mode` attributed](https://github.com/open-telemetry/semantic-conventions/issues/1139). The consequence of this is that `cpu.mode` needs to have a broad enum in its root definition, with special exemptions in each different `ref` of `cpu.mode`, since `cpu.mode` used in `process.cpu.time` vs `container.cpu.time` vs `system.cpu.time` etc. has different subsets of the overall enum values. We decided as a group to accept the consequence in this case, however it isn't something we're keen on dealing with all over system semconv, as the instrumentation ends up polluted with so many edge cases in each namespace that it defeats the purpose of sharing the attribute in the first place. + +## Two Class Design Strategy + +Relevant discussions: [\#1403 (particular comment)](https://github.com/open-telemetry/semantic-conventions/issues/1403#issuecomment-2368815634) + +We are considering two personas for system semconv instrumentation. If we have a piece of instrumentation, we decide which persona it is meant for and use that to make the decision for how we should name/treat that piece of instrumentation. + +### General Class: A generalized cross-platform use case we want any user of instrumentation to be able to easily access + +When instrumentation is meant for the General Class, we will strive to make the names and examples as prescriptive as possible. This instrumentation is what will drive the most important use cases we really want to cover with the system semantic conventions. Things like dashboards, alerts, and broader o11y setup tutorials will largely feature General Class instrumentation covering the [basic use cases](./use-cases.md) we have laid out as a group. We want this instrumentation to be very clear exactly how and when they should be used. Bucket 1 instrumentation will be recommended as **on by default**. + +### Specialist Class: A more specific use case that specialists could enable to get more in-depth information that they already understand how to use + +When instrumentation falls into the Specialist Class, we are assuming the target audience is already familiar with the concept and knows exactly what they are looking for and why. The goal for Specialist Class instrumentation is to ensure that users who have very specific and detailed needs are still covered by our semantic conventions so they don't need to go out of their way coming up with their own, risking the same kind of disparate instrumentation problem that semantic conventions are intended to solve. +The main differences in how we handle Speciialist Class instrumentation are: + +1. The names and resulting values will map directly to what a user would expect hunting down the information themselves. We will rarely be prescriptive in how the information should be used or how it should be broken down. For example, a metric to represent a process's cgroup would have the resulting value match exactly to what the result would be if the user called `cat /proc/PID/cgroup`. +2. If a metric is specific to a particular operating system, the operating system will be in the name. See Operating System in names (TODO: Link section) for more information. For example, a metric for a process's cgroup would be `process.linux.cgroup`, given that cgroups are a specific Linux kernel feature. + +### Examples + +Some General Class examples: + +* Memory/CPU usage and utilization metrics +* General disk and network metrics +* Universal system/process information (names, identifiers, basic specs) + +Some Specialist Class examples: + +* Particular Linux features like special process/system information in procfs (see things like [/proc/meminfo](https://man7.org/linux/man-pages/man5/proc_meminfo.5.html) or [cgroups](https://man7.org/linux/man-pages/man7/cgroups.7.html)) +* Particular Windows features like special process information (see things like [Windows Handles](https://learn.microsoft.com/en-us/windows/win32/sysinfo/about-handles-and-objects), [Process Working Set](https://learn.microsoft.com/en-us/windows/win32/procthread/process-working-set)) +* Niche process information like open file descriptors, page faults, etc. + +## Operating System in names + +Relevant discussions: [\#1255](https://github.com/open-telemetry/semantic-conventions/issues/1255) + +Monitoring operating systems is an old practice, and there are lots of ways to skin the cat within different platforms. There are lots of metrics, even in basic stuff like memory usage, where there are platform specific pieces of information that are valuable to those who really specialize in that platform. + +Thus we have decided that any instrumentation that is: + +1. Specific to a particular operating system +2. Not meant to be part of what we consider our most important general use cases + +Will have the Operating System name as part of the namespace. For example, there may be `process.linux`, `process.windows`, or `process.posix` names for metrics and attributes. We will not have root `linux.*`, `windows.*`, or `posix.*` namespaces. This is because of the principle we’re trying to uphold from the [Namespaces section](https://docs.google.com/document/d/1fCHZQemLun7qh5y--seBagPQZRuy_SuQJ2SOb2n2ZzU/edit?resourcekey=0-AZdnzcIOietd-cq6sGy-IA&tab=t.uq19eerhwz7#bookmark=id.2rkukcwjxprh); we still want the instrumentation source to be represented by the root namespace of the attribute/metric. If we had OS root namespaces, different sources like `system`, `process`, etc. could get very tangled within each OS namespace, defeating the intended design philosophy. + +## Case study: `process.cgroup` + +Relevant discussions: [\#1357](https://github.com/open-telemetry/semantic-conventions/issues/1357), [\#1364 (particular thread)](https://github.com/open-telemetry/semantic-conventions/pull/1364#discussion_r1730743509) + +In the `hostmetricsreceiver`, there is a Resource Attribute called `process.cgroup`. How should this metric be adopted in System Semantic Conventions? + +### Which class does this fall under? + +Based on our definitions, this attribute would fall under Bucket 2: + +* `cgroups` are a Linux-specific feature +* It is not directly part of any of the default out-of-the-box usecases we want to cover + +### What should it be named? + +Since it is in Bucket 2, and is a Linux-specific feature, it follows that this attribute should be named `process.linux.cgroup`. + +### What should the value be? + +Since it is Bucket 2, we don't want to be too prescriptive. A user who needs to know the `cgroup` of a process likely already has a pretty good idea of how to interpret it and use it further, and it would not be worth it for this Working Group to try and come up with every possible edge case for how it might be used. It is much simpler for this attribute as it falls under our purview to simply reflect the value from the OS, i.e. the same thing as `cat /proc/PID/cgroup`. With cgroups in particular, there is high likelihood that more specialized semconv instrumentation could be developed, particularly in support of more specialized container runtime or systemd instrumentation. It's more useful for a working group that would develop special instrumentation that leverages cgroups would be more prescriptive about how the cgroup information should be interpreted and broken down with more specificity. diff --git a/docs/non-normative/groups/system/use-cases.md b/docs/non-normative/groups/system/use-cases.md new file mode 100644 index 0000000000..ff38987e23 --- /dev/null +++ b/docs/non-normative/groups/system/use-cases.md @@ -0,0 +1,77 @@ +# **System Semantic Conventions: General Use Cases** + +This document is a collection of the use cases that we want to cover with the System Semantic Conventions. The use cases outlined here inform the working group’s decisions around what instrumentation is considered **required**. +Use cases in this document will be stated in a generic way that does not refer to any potentially existing instrumentation in semconv as of writing, such that when we do dig into specific instrumentation, we understand their importance based on our holistic view of expected use cases. + +## *Legend* + +`General Information` \= The information that should be discoverable either through the entity, metrics, or metric attributes. + +`Dashboard` \= The information that should be attainable through metrics to create a comprehensive dashboard. + +`Alerts` \= Some examples of common alerts that should be creatable with the available information. + +## **Host** + +A user should be able to monitor the health of a host, including monitoring resource consumption, unexpected errors due to resource exhaustion or malfunction of core components of a host or fleet of hosts (network stack, memory, CPU…). + +### General Information + +* Machine name +* ID (relevant to its context, could be a cloud provider ID or just base machine ID) +* OS information (platform, version, architecture, etc) +* Number of CPU cores +* Memory Capacity + +### Dashboard + +* Memory utilization +* CPU utilization +* Disk utilization +* Disk throughput +* Network traffic + +### Alerts + +* VM is up +* Memory/CPU/Disk utilization goes above a % threshold +* Network activity spikes unexpectedly + +## Notes + +* The alerts in particular should be capable of being uniformly applied to an entire fleet of hosts +* The user may be monitoring a virtualization host i.e. VMWare or Proxmox, and the instrumentation to monitor the health of the root host and the virtual machines it's spawned can be largely the same + +## **Process** + +A user should be able to monitor the health of an arbitrary process using data provided by the OS. +Reasons a user may want this: + +1. The process they want to monitor isn't covered by more specific semconv instrumentation such as language runtime metrics, db, http, etc. +2. They are monitoring lots of processes and want to have a set of uniform instrumentation for all of them. +3. Personal preference/legacy reasons; they might already be using OS signals to monitor stuff and it's an easier lift for them to move to basic process instrumentation, then move to other specific semconv over time. + +### General Information + +* Process name +* Pid +* User/owner + +### Dashboard + +* Physical Memory usage and/or utilization +* Virtual Memory usage +* CPU usage and/or utilization +* Disk throughput +* Network throughput + +### Alert + +* Process stops +* Memory/CPU usage/utilization goes above a threshold +* Memory exclusively rises over a period of time (memory leak detection) + +### Notes + +* Unless the OS provides the utilization data directly, the utilization requires calculation. Process instrumentation would need to be associated with a host entity that contains data about its memory capacity for utilization metrics to be calculated. +* Process instrumentation can also be used as data for benchmark evaluations, collecting the data for a period of time and evaluating the timeseries to get benchmarking/overhead insights about the process From 6900c91cd3e146273f8d51d469d382966d149ef1 Mon Sep 17 00:00:00 2001 From: braydonk Date: Tue, 26 Nov 2024 15:03:00 +0000 Subject: [PATCH 2/7] add new docs folder to CODEOWNERS --- .github/CODEOWNERS | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 6f7ec8d555..3e4d55af66 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -53,6 +53,7 @@ /model/os/ @open-telemetry/specs-semconv-approvers @open-telemetry/semconv-system-approvers /model/process/ @open-telemetry/specs-semconv-approvers @open-telemetry/semconv-system-approvers @open-telemetry/semconv-security-approvers /model/system/ @open-telemetry/specs-semconv-approvers @open-telemetry/semconv-system-approvers +/docs/non-normative/groups/system @open-telemetry/specs-semconv-approvers @open-telemetry/semconv-system-approvers # Mobile semantic conventions /docs/mobile/ @open-telemetry/specs-semconv-approvers @open-telemetry/semconv-mobile-approvers From 69a97da91a82c7f3f7b0d872ab8ea4509cd4a82b Mon Sep 17 00:00:00 2001 From: braydonk Date: Tue, 26 Nov 2024 15:07:19 +0000 Subject: [PATCH 3/7] change old Bucket verbiage to Class --- docs/non-normative/groups/system/design-philosophy.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/non-normative/groups/system/design-philosophy.md b/docs/non-normative/groups/system/design-philosophy.md index ec73bfe5eb..bb9d352c4e 100644 --- a/docs/non-normative/groups/system/design-philosophy.md +++ b/docs/non-normative/groups/system/design-philosophy.md @@ -35,7 +35,7 @@ We are considering two personas for system semconv instrumentation. If we have a ### General Class: A generalized cross-platform use case we want any user of instrumentation to be able to easily access -When instrumentation is meant for the General Class, we will strive to make the names and examples as prescriptive as possible. This instrumentation is what will drive the most important use cases we really want to cover with the system semantic conventions. Things like dashboards, alerts, and broader o11y setup tutorials will largely feature General Class instrumentation covering the [basic use cases](./use-cases.md) we have laid out as a group. We want this instrumentation to be very clear exactly how and when they should be used. Bucket 1 instrumentation will be recommended as **on by default**. +When instrumentation is meant for the General Class, we will strive to make the names and examples as prescriptive as possible. This instrumentation is what will drive the most important use cases we really want to cover with the system semantic conventions. Things like dashboards, alerts, and broader o11y setup tutorials will largely feature General Class instrumentation covering the [basic use cases](./use-cases.md) we have laid out as a group. We want this instrumentation to be very clear exactly how and when they should be used. General Class instrumentation will be recommended as **on by default**. ### Specialist Class: A more specific use case that specialists could enable to get more in-depth information that they already understand how to use @@ -80,15 +80,15 @@ In the `hostmetricsreceiver`, there is a Resource Attribute called `process.cgro ### Which class does this fall under? -Based on our definitions, this attribute would fall under Bucket 2: +Based on our definitions, this attribute would fall under Specialist Class: * `cgroups` are a Linux-specific feature * It is not directly part of any of the default out-of-the-box usecases we want to cover ### What should it be named? -Since it is in Bucket 2, and is a Linux-specific feature, it follows that this attribute should be named `process.linux.cgroup`. +Since this metric falls under Specialist Class, and is a Linux-specific feature, it follows that this attribute should be named `process.linux.cgroup`. ### What should the value be? -Since it is Bucket 2, we don't want to be too prescriptive. A user who needs to know the `cgroup` of a process likely already has a pretty good idea of how to interpret it and use it further, and it would not be worth it for this Working Group to try and come up with every possible edge case for how it might be used. It is much simpler for this attribute as it falls under our purview to simply reflect the value from the OS, i.e. the same thing as `cat /proc/PID/cgroup`. With cgroups in particular, there is high likelihood that more specialized semconv instrumentation could be developed, particularly in support of more specialized container runtime or systemd instrumentation. It's more useful for a working group that would develop special instrumentation that leverages cgroups would be more prescriptive about how the cgroup information should be interpreted and broken down with more specificity. +Since this metric falls under Specialist Class, we don't want to be too prescriptive. A user who needs to know the `cgroup` of a process likely already has a pretty good idea of how to interpret it and use it further, and it would not be worth it for this Working Group to try and come up with every possible edge case for how it might be used. It is much simpler for this attribute as it falls under our purview to simply reflect the value from the OS, i.e. the same thing as `cat /proc/PID/cgroup`. With cgroups in particular, there is high likelihood that more specialized semconv instrumentation could be developed, particularly in support of more specialized container runtime or systemd instrumentation. It's more useful for a working group that would develop special instrumentation that leverages cgroups would be more prescriptive about how the cgroup information should be interpreted and broken down with more specificity. From 2183af10d237537a263f2e125ab119ca55e58bf4 Mon Sep 17 00:00:00 2001 From: braydonk Date: Wed, 27 Nov 2024 14:17:35 +0000 Subject: [PATCH 4/7] address typo and nit comments --- docs/non-normative/groups/system/design-philosophy.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/non-normative/groups/system/design-philosophy.md b/docs/non-normative/groups/system/design-philosophy.md index bb9d352c4e..636c777ca9 100644 --- a/docs/non-normative/groups/system/design-philosophy.md +++ b/docs/non-normative/groups/system/design-philosophy.md @@ -1,6 +1,6 @@ -# **System Semantic Conventions: Instrumentation Design Philosophy** +# System Semantic Conventions: Instrumentation Design Philosophy -The System Semantic Conventions are caught in a strange dichotomy that is unique among other semconv groups. While we want to make sure we cover obvious generic use cases, monitoring system health is a very old practice with lots of different existing strategies. While we can cover the basic use cases in cross platform ways, we want to make sure that users who specialize in certain platforms aren't left in the lurch; if users aren't given recommendations for particular types of data that isn't cross-platform and universal,they may come up with their own disparate ideas for how that instrumentation should look, leading to the kind of fracturing that the semantic conventions should be in place to avoid. +The System Semantic Conventions are caught in a strange dichotomy that is unique among other semconv groups. While we want to make sure we cover obvious generic use cases, monitoring system health is a very old practice with lots of different existing strategies. While we can cover the basic use cases in cross platform ways, we want to make sure that users who specialize in certain platforms aren't left in the lurch; if users aren't given recommendations for particular types of data that isn't cross-platform and universal, they may come up with their own disparate ideas for how that instrumentation should look, leading to the kind of fracturing that the semantic conventions should be in place to avoid. The following sections address some of the most common instrumentation design questions, and how we as a working group have opted to address them. In some cases they are unique to the common semantic conventions guidance due to our unique circumstance, and those cases will be called out specifically. @@ -43,7 +43,7 @@ When instrumentation falls into the Specialist Class, we are assuming the target The main differences in how we handle Speciialist Class instrumentation are: 1. The names and resulting values will map directly to what a user would expect hunting down the information themselves. We will rarely be prescriptive in how the information should be used or how it should be broken down. For example, a metric to represent a process's cgroup would have the resulting value match exactly to what the result would be if the user called `cat /proc/PID/cgroup`. -2. If a metric is specific to a particular operating system, the operating system will be in the name. See Operating System in names (TODO: Link section) for more information. For example, a metric for a process's cgroup would be `process.linux.cgroup`, given that cgroups are a specific Linux kernel feature. +2. If a piece of instrumentation is specific to a particular operating system, the name of the operating system will be in the instrumentation name. See [Operating System in names](#operating-system-in-names) for more information. For example, a metric for a process's cgroup would be `process.linux.cgroup`, given that cgroups are a specific Linux kernel feature. ### Examples @@ -56,7 +56,7 @@ Some General Class examples: Some Specialist Class examples: * Particular Linux features like special process/system information in procfs (see things like [/proc/meminfo](https://man7.org/linux/man-pages/man5/proc_meminfo.5.html) or [cgroups](https://man7.org/linux/man-pages/man7/cgroups.7.html)) -* Particular Windows features like special process information (see things like [Windows Handles](https://learn.microsoft.com/en-us/windows/win32/sysinfo/about-handles-and-objects), [Process Working Set](https://learn.microsoft.com/en-us/windows/win32/procthread/process-working-set)) +* Particular Windows features like special process information (see things like [Windows Handles](https://learn.microsoft.com/windows/win32/sysinfo/about-handles-and-objects), [Process Working Set](https://learn.microsoft.com/windows/win32/procthread/process-working-set)) * Niche process information like open file descriptors, page faults, etc. ## Operating System in names @@ -70,7 +70,7 @@ Thus we have decided that any instrumentation that is: 1. Specific to a particular operating system 2. Not meant to be part of what we consider our most important general use cases -Will have the Operating System name as part of the namespace. For example, there may be `process.linux`, `process.windows`, or `process.posix` names for metrics and attributes. We will not have root `linux.*`, `windows.*`, or `posix.*` namespaces. This is because of the principle we’re trying to uphold from the [Namespaces section](https://docs.google.com/document/d/1fCHZQemLun7qh5y--seBagPQZRuy_SuQJ2SOb2n2ZzU/edit?resourcekey=0-AZdnzcIOietd-cq6sGy-IA&tab=t.uq19eerhwz7#bookmark=id.2rkukcwjxprh); we still want the instrumentation source to be represented by the root namespace of the attribute/metric. If we had OS root namespaces, different sources like `system`, `process`, etc. could get very tangled within each OS namespace, defeating the intended design philosophy. +Will have the Operating System name as part of the namespace. For example, there may be `process.linux`, `process.windows`, or `process.posix` names for metrics and attributes. We will not have root `linux.*`, `windows.*`, or `posix.*` namespaces. This is because of the principle we’re trying to uphold from the [Namespaces section](#namespaces); we still want the instrumentation source to be represented by the root namespace of the attribute/metric. If we had OS root namespaces, different sources like `system`, `process`, etc. could get very tangled within each OS namespace, defeating the intended design philosophy. ## Case study: `process.cgroup` From 3c3c38500c0ac9a27a691ab0d90fb2706deed068 Mon Sep 17 00:00:00 2001 From: braydonk Date: Wed, 27 Nov 2024 14:23:49 +0000 Subject: [PATCH 5/7] add additional relevant discussion link --- docs/non-normative/groups/system/design-philosophy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/non-normative/groups/system/design-philosophy.md b/docs/non-normative/groups/system/design-philosophy.md index 636c777ca9..9c49d37055 100644 --- a/docs/non-normative/groups/system/design-philosophy.md +++ b/docs/non-normative/groups/system/design-philosophy.md @@ -61,7 +61,7 @@ Some Specialist Class examples: ## Operating System in names -Relevant discussions: [\#1255](https://github.com/open-telemetry/semantic-conventions/issues/1255) +Relevant discussions: [\#1255](https://github.com/open-telemetry/semantic-conventions/issues/1255), [\#1364](https://github.com/open-telemetry/semantic-conventions/pull/1364#discussion_r1852465994) Monitoring operating systems is an old practice, and there are lots of ways to skin the cat within different platforms. There are lots of metrics, even in basic stuff like memory usage, where there are platform specific pieces of information that are valuable to those who really specialize in that platform. From 487af83f99a9ab8eeeb937e6cfac28638114cd8b Mon Sep 17 00:00:00 2001 From: braydonk Date: Thu, 19 Dec 2024 14:51:05 +0000 Subject: [PATCH 6/7] address review comments --- .../groups/system/design-philosophy.md | 65 ++++++++++++++----- docs/non-normative/groups/system/use-cases.md | 25 ++++--- 2 files changed, 63 insertions(+), 27 deletions(-) diff --git a/docs/non-normative/groups/system/design-philosophy.md b/docs/non-normative/groups/system/design-philosophy.md index 9c49d37055..aabcf8e5de 100644 --- a/docs/non-normative/groups/system/design-philosophy.md +++ b/docs/non-normative/groups/system/design-philosophy.md @@ -21,11 +21,11 @@ The System Semantic Conventions generally cover the following namespaces: Deciding on the namespace of a metric/attribute is generally informed by the following belief: -**The namespace of a metric/attribute should logically map to the Operating System concept being considered the instrumentation source.** +**The namespace of a metric/attribute should logically map to the Operating System concept being considered as the instrumentation source.** The most obvious example of this is with language runtime metrics and `process` namespace metrics. Many of these metrics are very similar; most language runtimes provide some manner of `cpu.time`, `memory.usage` and similar metrics. If we were considering de-duplication as the top value in our design, it would follow that `process.cpu.time` and `process.memory.usage` should simply be referenced by any language runtime that might produce those metrics. However, as a working group we believe it is important that `process` namespace and runtime namespace metrics remain separate, because `process` metrics are meant to represent an **OS-level process as the instrumentation source**, whereas runtime metrics represent **the language runtime as the instrumentation source**. -In some cases this is simply a matter of making the instrumentation's purpose as clear as possible, but there are cases where attempts to share definitions across distinct instrumentation sources poses the potential for a clash. The concrete example of a time we accepted this consequence is with `cpu.mode`; the decision was to [unify all separate instances of `*.cpu.state` attributes into one shared `cpu.mode` attributed](https://github.com/open-telemetry/semantic-conventions/issues/1139). The consequence of this is that `cpu.mode` needs to have a broad enum in its root definition, with special exemptions in each different `ref` of `cpu.mode`, since `cpu.mode` used in `process.cpu.time` vs `container.cpu.time` vs `system.cpu.time` etc. has different subsets of the overall enum values. We decided as a group to accept the consequence in this case, however it isn't something we're keen on dealing with all over system semconv, as the instrumentation ends up polluted with so many edge cases in each namespace that it defeats the purpose of sharing the attribute in the first place. +In some cases this is simply a matter of making the instrumentation's purpose as clear as possible, but there are cases where attempts to share definitions across distinct instrumentation sources poses the potential for a clash. The concrete example of a time we accepted this consequence is with `cpu.mode`; the decision was to [unify all separate instances of `*.cpu.state` attributes into one shared `cpu.mode` attribute](https://github.com/open-telemetry/semantic-conventions/issues/1139). The consequence of this is that `cpu.mode` needs to have a broad enum in its root definition, with special exemptions in each different `ref` of `cpu.mode`, since `cpu.mode` used in `process.cpu.time` vs `container.cpu.time` vs `system.cpu.time` etc. has different subsets of the overall enum values. We decided as a group to accept the consequence in this case, however it isn't something we're keen on dealing with all over system semconv, as the instrumentation ends up polluted with so many edge cases in each namespace that it defeats the purpose of sharing the attribute in the first place. ## Two Class Design Strategy @@ -35,7 +35,7 @@ We are considering two personas for system semconv instrumentation. If we have a ### General Class: A generalized cross-platform use case we want any user of instrumentation to be able to easily access -When instrumentation is meant for the General Class, we will strive to make the names and examples as prescriptive as possible. This instrumentation is what will drive the most important use cases we really want to cover with the system semantic conventions. Things like dashboards, alerts, and broader o11y setup tutorials will largely feature General Class instrumentation covering the [basic use cases](./use-cases.md) we have laid out as a group. We want this instrumentation to be very clear exactly how and when they should be used. General Class instrumentation will be recommended as **on by default**. +When instrumentation is meant for the General Class, we will strive to make the names and examples as prescriptive as possible. This instrumentation is what will drive the most important use cases we really want to cover with the system semantic conventions. Things like dashboards, alerts, and broader o11y setup tutorials will largely feature General Class instrumentation covering the [basic use cases][use cases doc] we have laid out as a group. We want this instrumentation to be very clear exactly how and when they should be used. General Class instrumentation will be recommended as **on by default**. ### Specialist Class: A more specific use case that specialists could enable to get more in-depth information that they already understand how to use @@ -59,36 +59,67 @@ Some Specialist Class examples: * Particular Windows features like special process information (see things like [Windows Handles](https://learn.microsoft.com/windows/win32/sysinfo/about-handles-and-objects), [Process Working Set](https://learn.microsoft.com/windows/win32/procthread/process-working-set)) * Niche process information like open file descriptors, page faults, etc. -## Operating System in names +## Instrumentation Design Guide -Relevant discussions: [\#1255](https://github.com/open-telemetry/semantic-conventions/issues/1255), [\#1364](https://github.com/open-telemetry/semantic-conventions/pull/1364#discussion_r1852465994) +When designing new instrumentation we will follow these steps as closely as possible: -Monitoring operating systems is an old practice, and there are lots of ways to skin the cat within different platforms. There are lots of metrics, even in basic stuff like memory usage, where there are platform specific pieces of information that are valuable to those who really specialize in that platform. +### Choosing Instrumentation Class -Thus we have decided that any instrumentation that is: +In System Semantic Conventions, the most important questions when deciding whether a piece of instrumentation is General or Specialist would be: -1. Specific to a particular operating system -2. Not meant to be part of what we consider our most important general use cases +* Is it cross-platform? +* Does it support our [most important use cases][use cases doc] then we will make it general class + +The answer to both these questions will likely need to be "Yes" for the instrumentation to be considered General Class. Since the General Class instrumentation is what we expect the widest audience to use, we will need to scrutinize it more closely to ensure all of it is as necessary and useful as possible. -Will have the Operating System name as part of the namespace. For example, there may be `process.linux`, `process.windows`, or `process.posix` names for metrics and attributes. We will not have root `linux.*`, `windows.*`, or `posix.*` namespaces. This is because of the principle we’re trying to uphold from the [Namespaces section](#namespaces); we still want the instrumentation source to be represented by the root namespace of the attribute/metric. If we had OS root namespaces, different sources like `system`, `process`, etc. could get very tangled within each OS namespace, defeating the intended design philosophy. +If the answer to either one of these is "No", then we will likely consider it Specialist Class. -## Case study: `process.cgroup` +### Naming + +For General Class, choose a name that most accurately descibes the general concept without biasing to a platform. Lean towards simplicity where possible, as this is the instrumentation that will be used by the widest audience; we want it to be as clear to understand and ergonomic to use as possible. + +For Specialist Class, choose a name that most directly matches the words generally used to describe the concept in context. Since this instrumentation will be optional, and likely sought out by the people who already know exactly what they want out of it, we can prioritize matching the names as closely to their definition as possible. +For specialist class metrics that are platform exclusive, we will include the OS in the namespace as a sub-namespace (not the root namespace) if it is unlikely that the same metric name could ever be applied in a cross-platform manner. See [this section](#operating-system-in-names) for more details. + +### Value + +For General Class, the value we can be prescriptive with the value of the instrumentation. We want to ensure General Class instrumentation most closely matches our vision for our general use cases, and we want to ensure that users who are not specialists and just want the most important basic information can acquire it as easily as possible using out-of-the-box semconv instrumentation. This means we are more likely within General Class instrumentation to make judgements about exactly what the value should be, and whether the value should be reshaped by instrumentation in any case when pulling the values from sources if it serves general purpose use cases. + +For Specialist Class, we should strive not to be prescriptive and instead match the concept being modeled as closely as possible. We expect specialist class instrumentation to be enabled by the people who already understand it. In a System Semconv context, these may be things a user previously gathered manually or through existing OS tools that they want to model as OTLP. + +### Case study: `process.cgroup` Relevant discussions: [\#1357](https://github.com/open-telemetry/semantic-conventions/issues/1357), [\#1364 (particular thread)](https://github.com/open-telemetry/semantic-conventions/pull/1364#discussion_r1730743509) In the `hostmetricsreceiver`, there is a Resource Attribute called `process.cgroup`. How should this metric be adopted in System Semantic Conventions? -### Which class does this fall under? - Based on our definitions, this attribute would fall under Specialist Class: * `cgroups` are a Linux-specific feature * It is not directly part of any of the default out-of-the-box usecases we want to cover -### What should it be named? +In this attribute's case, there are two important considerations when deciding on the name: + +* The attribute is specialist class +* It is Linux exclusive, and is unlikely to ever be introduced in other operating systems since the other major platforms have their own versions of it (Windows Job Objects, BSD Jails, etc) + +This means we should pick a name that matches the verbiage used by specialists in context when referring to this concept. The way you would refer to this would be "a process's cgroup, collected from `/proc//cgroup`". So we would start with the name `process.cgroup`. We also determined that this attribute is Linux-exclusive and are confident it will remain as such, so we land on the name `process.linux.cgroup`. + +Since this metric falls under Specialist Class, we don't want to be too prescriptive about the value. A user who needs to know the `cgroup` of a process likely already has a pretty good idea of how to interpret it and use it further, and it would not be worth it for this Working Group to try and come up with every possible edge case for how it might be used. It is much simpler for this attribute, insofar as it falls under our purview, to simply reflect the value from the OS, i.e. the direct value from `cat /proc//cgroup`. With cgroups in particular, there is high likelihood that more specialized semconv instrumentation could be developed, particularly in support of more specialized container runtime or systemd instrumentation. It's more useful for a working group developing special instrumentation that leverages cgroups to be more prescriptive about how the cgroup information should be interpreted and broken down with more specificity. + +## Operating System in names + +Relevant discussions: [\#1255](https://github.com/open-telemetry/semantic-conventions/issues/1255), [\#1364](https://github.com/open-telemetry/semantic-conventions/pull/1364#discussion_r1852465994) + +Monitoring operating systems is an old practice, and there are lots of ways to skin the cat within different platforms. There are lots of metrics, even in basic stuff like memory usage, where there are platform specific pieces of information that are valuable to those who really specialize in that platform. + +Thus we have decided that any instrumentation that is: + +1. Specific to a particular operating system +2. Not meant to be part of what we consider our most important general use cases -Since this metric falls under Specialist Class, and is a Linux-specific feature, it follows that this attribute should be named `process.linux.cgroup`. +will have the Operating System name as part of the namespace. -### What should the value be? +For example, there may be `process.linux`, `process.windows`, or `process.posix` names for metrics and attributes. We will not have root `linux.*`, `windows.*`, or `posix.*` namespaces. This is because of the principle we’re trying to uphold from the [Namespaces section](#namespaces); we still want the instrumentation source to be represented by the root namespace of the attribute/metric. If we had OS root namespaces, different sources like `system`, `process`, etc. could get very tangled within each OS namespace, defeating the intended design philosophy. -Since this metric falls under Specialist Class, we don't want to be too prescriptive. A user who needs to know the `cgroup` of a process likely already has a pretty good idea of how to interpret it and use it further, and it would not be worth it for this Working Group to try and come up with every possible edge case for how it might be used. It is much simpler for this attribute as it falls under our purview to simply reflect the value from the OS, i.e. the same thing as `cat /proc/PID/cgroup`. With cgroups in particular, there is high likelihood that more specialized semconv instrumentation could be developed, particularly in support of more specialized container runtime or systemd instrumentation. It's more useful for a working group that would develop special instrumentation that leverages cgroups would be more prescriptive about how the cgroup information should be interpreted and broken down with more specificity. +[use cases doc](./use-cases.md) diff --git a/docs/non-normative/groups/system/use-cases.md b/docs/non-normative/groups/system/use-cases.md index ff38987e23..32183e6298 100644 --- a/docs/non-normative/groups/system/use-cases.md +++ b/docs/non-normative/groups/system/use-cases.md @@ -13,14 +13,14 @@ Use cases in this document will be stated in a generic way that does not refer t ## **Host** -A user should be able to monitor the health of a host, including monitoring resource consumption, unexpected errors due to resource exhaustion or malfunction of core components of a host or fleet of hosts (network stack, memory, CPU…). +A user should be able to monitor the health of a host, including monitoring resource consumption, unexpected errors due to resource exhaustion or malfunction of core components of a host or fleet of hosts (network stack, memory, CPU, etc.). ### General Information * Machine name * ID (relevant to its context, could be a cloud provider ID or just base machine ID) * OS information (platform, version, architecture, etc) -* Number of CPU cores +* CPU Information * Memory Capacity ### Dashboard @@ -33,21 +33,27 @@ A user should be able to monitor the health of a host, including monitoring reso ### Alerts -* VM is up -* Memory/CPU/Disk utilization goes above a % threshold +* VM is down unexpectedly * Network activity spikes unexpectedly +* Memory/CPU/Disk utilization goes above a % threshold ## Notes -* The alerts in particular should be capable of being uniformly applied to an entire fleet of hosts -* The user may be monitoring a virtualization host i.e. VMWare or Proxmox, and the instrumentation to monitor the health of the root host and the virtual machines it's spawned can be largely the same +The alerts in particular should be capable of being uniformly applied to a heterogenous fleet of hosts. We will value the nature of cross-platform instrumentation to allow for effective alerting across a fleet regardless of the potential mixture of operating system platforms within it. + +The term `host` can mean different things in other contexts: + +* The term `host` in a network context, a central machine that many others are networked to, or the term `host` in a virtualization context +* The term `host` in a virtualization context, something that is hosting virtual guests such as VMs or containers + +In this context, a host is generally considered to be some individual machine, physical or virtual. This can be extra confusing, because a unique machine `host` can also be a network `host` or virtualization `host` at the same time. This is a complexity we will have to accept due to the fact that the `host` namespace is deeply embedded in existing OpenTelemetry instrumentation and general verbiage. To the best of our ability, network and virtualization `host` instrumentation will be kept distinct by being within other namespaces that clearly denote which version of the term `host` is being referred to, while the root `host` namespace will refer to an individual machine. ## **Process** A user should be able to monitor the health of an arbitrary process using data provided by the OS. Reasons a user may want this: -1. The process they want to monitor isn't covered by more specific semconv instrumentation such as language runtime metrics, db, http, etc. +1. The process they want to monitor doesn't have in-process runtime-specific instrumentation enabled or is not instrumentable at all, such as an antivirus or another background process. 2. They are monitoring lots of processes and want to have a set of uniform instrumentation for all of them. 3. Personal preference/legacy reasons; they might already be using OS signals to monitor stuff and it's an easier lift for them to move to basic process instrumentation, then move to other specific semconv over time. @@ -67,11 +73,10 @@ Reasons a user may want this: ### Alert -* Process stops +* Process stops unexpectedly * Memory/CPU usage/utilization goes above a threshold * Memory exclusively rises over a period of time (memory leak detection) ### Notes -* Unless the OS provides the utilization data directly, the utilization requires calculation. Process instrumentation would need to be associated with a host entity that contains data about its memory capacity for utilization metrics to be calculated. -* Process instrumentation can also be used as data for benchmark evaluations, collecting the data for a period of time and evaluating the timeseries to get benchmarking/overhead insights about the process +On top of alerts and dashboards, we will also consider the basic benchmarking of a process to be a general usecase. The basic cross platform stats that can be provided in a cross-platform manner can also be effectively used for this, and we will consider that when making decisions about process instrumentation. From 01f43e943767259ebb72f637b8fe5b428c8a16b0 Mon Sep 17 00:00:00 2001 From: braydonk Date: Thu, 19 Dec 2024 18:24:40 +0000 Subject: [PATCH 7/7] wrap lines to 80 characters in system non-normative --- .prettierignore | 4 + .../groups/system/design-philosophy.md | 266 +++++++++++++----- docs/non-normative/groups/system/use-cases.md | 117 +++++--- package.json | 13 +- 4 files changed, 291 insertions(+), 109 deletions(-) diff --git a/.prettierignore b/.prettierignore index c7137972a9..d71d790b75 100644 --- a/.prettierignore +++ b/.prettierignore @@ -7,6 +7,10 @@ !/docs/cloud*/** !/docs/attributes-registry* !/docs/attributes-registry*/** +!/docs/non-normative* +!/docs/non-normative/groups* +!/docs/non-normative/groups/system* +!/docs/non-normative/groups/system*/** /model /schemas CHANGELOG.md \ No newline at end of file diff --git a/docs/non-normative/groups/system/design-philosophy.md b/docs/non-normative/groups/system/design-philosophy.md index aabcf8e5de..a979a43ba9 100644 --- a/docs/non-normative/groups/system/design-philosophy.md +++ b/docs/non-normative/groups/system/design-philosophy.md @@ -1,125 +1,259 @@ # System Semantic Conventions: Instrumentation Design Philosophy -The System Semantic Conventions are caught in a strange dichotomy that is unique among other semconv groups. While we want to make sure we cover obvious generic use cases, monitoring system health is a very old practice with lots of different existing strategies. While we can cover the basic use cases in cross platform ways, we want to make sure that users who specialize in certain platforms aren't left in the lurch; if users aren't given recommendations for particular types of data that isn't cross-platform and universal, they may come up with their own disparate ideas for how that instrumentation should look, leading to the kind of fracturing that the semantic conventions should be in place to avoid. - -The following sections address some of the most common instrumentation design questions, and how we as a working group have opted to address them. In some cases they are unique to the common semantic conventions guidance due to our unique circumstance, and those cases will be called out specifically. +The System Semantic Conventions are caught in a strange dichotomy that is unique +among other semconv groups. While we want to make sure we cover obvious generic +use cases, monitoring system health is a very old practice with lots of +different existing strategies. While we can cover the basic use cases in cross +platform ways, we want to make sure that users who specialize in certain +platforms aren't left in the lurch; if users aren't given recommendations for +particular types of data that isn't cross-platform and universal, they may come +up with their own disparate ideas for how that instrumentation should look, +leading to the kind of fracturing that the semantic conventions should be in +place to avoid. + +The following sections address some of the most common instrumentation design +questions, and how we as a working group have opted to address them. In some +cases they are unique to the common semantic conventions guidance due to our +unique circumstance, and those cases will be called out specifically. ## Namespaces -Relevant discussions: [\#1161](https://github.com/open-telemetry/semantic-conventions/issues/1161) +Relevant discussions: +[\#1161](https://github.com/open-telemetry/semantic-conventions/issues/1161) The System Semantic Conventions generally cover the following namespaces: -* `system` -* `process` -* `host` -* `memory` -* `network` -* `disk` -* `memory` -* `os` - -Deciding on the namespace of a metric/attribute is generally informed by the following belief: - -**The namespace of a metric/attribute should logically map to the Operating System concept being considered as the instrumentation source.** - -The most obvious example of this is with language runtime metrics and `process` namespace metrics. Many of these metrics are very similar; most language runtimes provide some manner of `cpu.time`, `memory.usage` and similar metrics. If we were considering de-duplication as the top value in our design, it would follow that `process.cpu.time` and `process.memory.usage` should simply be referenced by any language runtime that might produce those metrics. However, as a working group we believe it is important that `process` namespace and runtime namespace metrics remain separate, because `process` metrics are meant to represent an **OS-level process as the instrumentation source**, whereas runtime metrics represent **the language runtime as the instrumentation source**. - -In some cases this is simply a matter of making the instrumentation's purpose as clear as possible, but there are cases where attempts to share definitions across distinct instrumentation sources poses the potential for a clash. The concrete example of a time we accepted this consequence is with `cpu.mode`; the decision was to [unify all separate instances of `*.cpu.state` attributes into one shared `cpu.mode` attribute](https://github.com/open-telemetry/semantic-conventions/issues/1139). The consequence of this is that `cpu.mode` needs to have a broad enum in its root definition, with special exemptions in each different `ref` of `cpu.mode`, since `cpu.mode` used in `process.cpu.time` vs `container.cpu.time` vs `system.cpu.time` etc. has different subsets of the overall enum values. We decided as a group to accept the consequence in this case, however it isn't something we're keen on dealing with all over system semconv, as the instrumentation ends up polluted with so many edge cases in each namespace that it defeats the purpose of sharing the attribute in the first place. +- `system` +- `process` +- `host` +- `memory` +- `network` +- `disk` +- `memory` +- `os` + +Deciding on the namespace of a metric/attribute is generally informed by the +following belief: + +**The namespace of a metric/attribute should logically map to the Operating +System concept being considered as the instrumentation source.** + +The most obvious example of this is with language runtime metrics and `process` +namespace metrics. Many of these metrics are very similar; most language +runtimes provide some manner of `cpu.time`, `memory.usage` and similar metrics. +If we were considering de-duplication as the top value in our design, it would +follow that `process.cpu.time` and `process.memory.usage` should simply be +referenced by any language runtime that might produce those metrics. However, as +a working group we believe it is important that `process` namespace and runtime +namespace metrics remain separate, because `process` metrics are meant to +represent an **OS-level process as the instrumentation source**, whereas runtime +metrics represent **the language runtime as the instrumentation source**. + +In some cases this is simply a matter of making the instrumentation's purpose as +clear as possible, but there are cases where attempts to share definitions +across distinct instrumentation sources poses the potential for a clash. The +concrete example of a time we accepted this consequence is with `cpu.mode`; the +decision was to +[unify all separate instances of `*.cpu.state` attributes into one shared `cpu.mode` attribute](https://github.com/open-telemetry/semantic-conventions/issues/1139). +The consequence of this is that `cpu.mode` needs to have a broad enum in its +root definition, with special exemptions in each different `ref` of `cpu.mode`, +since `cpu.mode` used in `process.cpu.time` vs `container.cpu.time` vs +`system.cpu.time` etc. has different subsets of the overall enum values. We +decided as a group to accept the consequence in this case, however it isn't +something we're keen on dealing with all over system semconv, as the +instrumentation ends up polluted with so many edge cases in each namespace that +it defeats the purpose of sharing the attribute in the first place. ## Two Class Design Strategy -Relevant discussions: [\#1403 (particular comment)](https://github.com/open-telemetry/semantic-conventions/issues/1403#issuecomment-2368815634) +Relevant discussions: +[\#1403 (particular comment)](https://github.com/open-telemetry/semantic-conventions/issues/1403#issuecomment-2368815634) -We are considering two personas for system semconv instrumentation. If we have a piece of instrumentation, we decide which persona it is meant for and use that to make the decision for how we should name/treat that piece of instrumentation. +We are considering two personas for system semconv instrumentation. If we have a +piece of instrumentation, we decide which persona it is meant for and use that +to make the decision for how we should name/treat that piece of instrumentation. ### General Class: A generalized cross-platform use case we want any user of instrumentation to be able to easily access -When instrumentation is meant for the General Class, we will strive to make the names and examples as prescriptive as possible. This instrumentation is what will drive the most important use cases we really want to cover with the system semantic conventions. Things like dashboards, alerts, and broader o11y setup tutorials will largely feature General Class instrumentation covering the [basic use cases][use cases doc] we have laid out as a group. We want this instrumentation to be very clear exactly how and when they should be used. General Class instrumentation will be recommended as **on by default**. +When instrumentation is meant for the General Class, we will strive to make the +names and examples as prescriptive as possible. This instrumentation is what +will drive the most important use cases we really want to cover with the system +semantic conventions. Things like dashboards, alerts, and broader o11y setup +tutorials will largely feature General Class instrumentation covering the [basic +use cases][use cases doc] we have laid out as a group. We want this +instrumentation to be very clear exactly how and when they should be used. +General Class instrumentation will be recommended as **on by default**. ### Specialist Class: A more specific use case that specialists could enable to get more in-depth information that they already understand how to use -When instrumentation falls into the Specialist Class, we are assuming the target audience is already familiar with the concept and knows exactly what they are looking for and why. The goal for Specialist Class instrumentation is to ensure that users who have very specific and detailed needs are still covered by our semantic conventions so they don't need to go out of their way coming up with their own, risking the same kind of disparate instrumentation problem that semantic conventions are intended to solve. -The main differences in how we handle Speciialist Class instrumentation are: - -1. The names and resulting values will map directly to what a user would expect hunting down the information themselves. We will rarely be prescriptive in how the information should be used or how it should be broken down. For example, a metric to represent a process's cgroup would have the resulting value match exactly to what the result would be if the user called `cat /proc/PID/cgroup`. -2. If a piece of instrumentation is specific to a particular operating system, the name of the operating system will be in the instrumentation name. See [Operating System in names](#operating-system-in-names) for more information. For example, a metric for a process's cgroup would be `process.linux.cgroup`, given that cgroups are a specific Linux kernel feature. +When instrumentation falls into the Specialist Class, we are assuming the target +audience is already familiar with the concept and knows exactly what they are +looking for and why. The goal for Specialist Class instrumentation is to ensure +that users who have very specific and detailed needs are still covered by our +semantic conventions so they don't need to go out of their way coming up with +their own, risking the same kind of disparate instrumentation problem that +semantic conventions are intended to solve. The main differences in how we +handle Speciialist Class instrumentation are: + +1. The names and resulting values will map directly to what a user would expect + hunting down the information themselves. We will rarely be prescriptive in + how the information should be used or how it should be broken down. For + example, a metric to represent a process's cgroup would have the resulting + value match exactly to what the result would be if the user called + `cat /proc/PID/cgroup`. +2. If a piece of instrumentation is specific to a particular operating system, + the name of the operating system will be in the instrumentation name. See + [Operating System in names](#operating-system-in-names) for more information. + For example, a metric for a process's cgroup would be `process.linux.cgroup`, + given that cgroups are a specific Linux kernel feature. ### Examples Some General Class examples: -* Memory/CPU usage and utilization metrics -* General disk and network metrics -* Universal system/process information (names, identifiers, basic specs) +- Memory/CPU usage and utilization metrics +- General disk and network metrics +- Universal system/process information (names, identifiers, basic specs) Some Specialist Class examples: -* Particular Linux features like special process/system information in procfs (see things like [/proc/meminfo](https://man7.org/linux/man-pages/man5/proc_meminfo.5.html) or [cgroups](https://man7.org/linux/man-pages/man7/cgroups.7.html)) -* Particular Windows features like special process information (see things like [Windows Handles](https://learn.microsoft.com/windows/win32/sysinfo/about-handles-and-objects), [Process Working Set](https://learn.microsoft.com/windows/win32/procthread/process-working-set)) -* Niche process information like open file descriptors, page faults, etc. +- Particular Linux features like special process/system information in procfs + (see things like + [/proc/meminfo](https://man7.org/linux/man-pages/man5/proc_meminfo.5.html) or + [cgroups](https://man7.org/linux/man-pages/man7/cgroups.7.html)) +- Particular Windows features like special process information (see things like + [Windows Handles](https://learn.microsoft.com/windows/win32/sysinfo/about-handles-and-objects), + [Process Working Set](https://learn.microsoft.com/windows/win32/procthread/process-working-set)) +- Niche process information like open file descriptors, page faults, etc. ## Instrumentation Design Guide -When designing new instrumentation we will follow these steps as closely as possible: +When designing new instrumentation we will follow these steps as closely as +possible: ### Choosing Instrumentation Class -In System Semantic Conventions, the most important questions when deciding whether a piece of instrumentation is General or Specialist would be: +In System Semantic Conventions, the most important questions when deciding +whether a piece of instrumentation is General or Specialist would be: -* Is it cross-platform? -* Does it support our [most important use cases][use cases doc] then we will make it general class +- Is it cross-platform? +- Does it support our [most important use cases][use cases doc] then we will + make it general class -The answer to both these questions will likely need to be "Yes" for the instrumentation to be considered General Class. Since the General Class instrumentation is what we expect the widest audience to use, we will need to scrutinize it more closely to ensure all of it is as necessary and useful as possible. +The answer to both these questions will likely need to be "Yes" for the +instrumentation to be considered General Class. Since the General Class +instrumentation is what we expect the widest audience to use, we will need to +scrutinize it more closely to ensure all of it is as necessary and useful as +possible. -If the answer to either one of these is "No", then we will likely consider it Specialist Class. +If the answer to either one of these is "No", then we will likely consider it +Specialist Class. -### Naming +### Naming -For General Class, choose a name that most accurately descibes the general concept without biasing to a platform. Lean towards simplicity where possible, as this is the instrumentation that will be used by the widest audience; we want it to be as clear to understand and ergonomic to use as possible. +For General Class, choose a name that most accurately descibes the general +concept without biasing to a platform. Lean towards simplicity where possible, +as this is the instrumentation that will be used by the widest audience; we want +it to be as clear to understand and ergonomic to use as possible. -For Specialist Class, choose a name that most directly matches the words generally used to describe the concept in context. Since this instrumentation will be optional, and likely sought out by the people who already know exactly what they want out of it, we can prioritize matching the names as closely to their definition as possible. -For specialist class metrics that are platform exclusive, we will include the OS in the namespace as a sub-namespace (not the root namespace) if it is unlikely that the same metric name could ever be applied in a cross-platform manner. See [this section](#operating-system-in-names) for more details. +For Specialist Class, choose a name that most directly matches the words +generally used to describe the concept in context. Since this instrumentation +will be optional, and likely sought out by the people who already know exactly +what they want out of it, we can prioritize matching the names as closely to +their definition as possible. For specialist class metrics that are platform +exclusive, we will include the OS in the namespace as a sub-namespace (not the +root namespace) if it is unlikely that the same metric name could ever be +applied in a cross-platform manner. See +[this section](#operating-system-in-names) for more details. ### Value -For General Class, the value we can be prescriptive with the value of the instrumentation. We want to ensure General Class instrumentation most closely matches our vision for our general use cases, and we want to ensure that users who are not specialists and just want the most important basic information can acquire it as easily as possible using out-of-the-box semconv instrumentation. This means we are more likely within General Class instrumentation to make judgements about exactly what the value should be, and whether the value should be reshaped by instrumentation in any case when pulling the values from sources if it serves general purpose use cases. - -For Specialist Class, we should strive not to be prescriptive and instead match the concept being modeled as closely as possible. We expect specialist class instrumentation to be enabled by the people who already understand it. In a System Semconv context, these may be things a user previously gathered manually or through existing OS tools that they want to model as OTLP. +For General Class, the value we can be prescriptive with the value of the +instrumentation. We want to ensure General Class instrumentation most closely +matches our vision for our general use cases, and we want to ensure that users +who are not specialists and just want the most important basic information can +acquire it as easily as possible using out-of-the-box semconv instrumentation. +This means we are more likely within General Class instrumentation to make +judgements about exactly what the value should be, and whether the value should +be reshaped by instrumentation in any case when pulling the values from sources +if it serves general purpose use cases. + +For Specialist Class, we should strive not to be prescriptive and instead match +the concept being modeled as closely as possible. We expect specialist class +instrumentation to be enabled by the people who already understand it. In a +System Semconv context, these may be things a user previously gathered manually +or through existing OS tools that they want to model as OTLP. ### Case study: `process.cgroup` -Relevant discussions: [\#1357](https://github.com/open-telemetry/semantic-conventions/issues/1357), [\#1364 (particular thread)](https://github.com/open-telemetry/semantic-conventions/pull/1364#discussion_r1730743509) +Relevant discussions: +[\#1357](https://github.com/open-telemetry/semantic-conventions/issues/1357), +[\#1364 (particular thread)](https://github.com/open-telemetry/semantic-conventions/pull/1364#discussion_r1730743509) -In the `hostmetricsreceiver`, there is a Resource Attribute called `process.cgroup`. How should this metric be adopted in System Semantic Conventions? +In the `hostmetricsreceiver`, there is a Resource Attribute called +`process.cgroup`. How should this metric be adopted in System Semantic +Conventions? Based on our definitions, this attribute would fall under Specialist Class: -* `cgroups` are a Linux-specific feature -* It is not directly part of any of the default out-of-the-box usecases we want to cover - -In this attribute's case, there are two important considerations when deciding on the name: - -* The attribute is specialist class -* It is Linux exclusive, and is unlikely to ever be introduced in other operating systems since the other major platforms have their own versions of it (Windows Job Objects, BSD Jails, etc) - -This means we should pick a name that matches the verbiage used by specialists in context when referring to this concept. The way you would refer to this would be "a process's cgroup, collected from `/proc//cgroup`". So we would start with the name `process.cgroup`. We also determined that this attribute is Linux-exclusive and are confident it will remain as such, so we land on the name `process.linux.cgroup`. - -Since this metric falls under Specialist Class, we don't want to be too prescriptive about the value. A user who needs to know the `cgroup` of a process likely already has a pretty good idea of how to interpret it and use it further, and it would not be worth it for this Working Group to try and come up with every possible edge case for how it might be used. It is much simpler for this attribute, insofar as it falls under our purview, to simply reflect the value from the OS, i.e. the direct value from `cat /proc//cgroup`. With cgroups in particular, there is high likelihood that more specialized semconv instrumentation could be developed, particularly in support of more specialized container runtime or systemd instrumentation. It's more useful for a working group developing special instrumentation that leverages cgroups to be more prescriptive about how the cgroup information should be interpreted and broken down with more specificity. +- `cgroups` are a Linux-specific feature +- It is not directly part of any of the default out-of-the-box usecases we want + to cover + +In this attribute's case, there are two important considerations when deciding +on the name: + +- The attribute is specialist class +- It is Linux exclusive, and is unlikely to ever be introduced in other + operating systems since the other major platforms have their own versions of + it (Windows Job Objects, BSD Jails, etc) + +This means we should pick a name that matches the verbiage used by specialists +in context when referring to this concept. The way you would refer to this would +be "a process's cgroup, collected from `/proc//cgroup`". So we would start +with the name `process.cgroup`. We also determined that this attribute is +Linux-exclusive and are confident it will remain as such, so we land on the name +`process.linux.cgroup`. + +Since this metric falls under Specialist Class, we don't want to be too +prescriptive about the value. A user who needs to know the `cgroup` of a process +likely already has a pretty good idea of how to interpret it and use it further, +and it would not be worth it for this Working Group to try and come up with +every possible edge case for how it might be used. It is much simpler for this +attribute, insofar as it falls under our purview, to simply reflect the value +from the OS, i.e. the direct value from `cat /proc//cgroup`. With cgroups +in particular, there is high likelihood that more specialized semconv +instrumentation could be developed, particularly in support of more specialized +container runtime or systemd instrumentation. It's more useful for a working +group developing special instrumentation that leverages cgroups to be more +prescriptive about how the cgroup information should be interpreted and broken +down with more specificity. ## Operating System in names -Relevant discussions: [\#1255](https://github.com/open-telemetry/semantic-conventions/issues/1255), [\#1364](https://github.com/open-telemetry/semantic-conventions/pull/1364#discussion_r1852465994) +Relevant discussions: +[\#1255](https://github.com/open-telemetry/semantic-conventions/issues/1255), +[\#1364](https://github.com/open-telemetry/semantic-conventions/pull/1364#discussion_r1852465994) -Monitoring operating systems is an old practice, and there are lots of ways to skin the cat within different platforms. There are lots of metrics, even in basic stuff like memory usage, where there are platform specific pieces of information that are valuable to those who really specialize in that platform. +Monitoring operating systems is an old practice, and there are lots of ways to +skin the cat within different platforms. There are lots of metrics, even in +basic stuff like memory usage, where there are platform specific pieces of +information that are valuable to those who really specialize in that platform. Thus we have decided that any instrumentation that is: -1. Specific to a particular operating system +1. Specific to a particular operating system 2. Not meant to be part of what we consider our most important general use cases -will have the Operating System name as part of the namespace. +will have the Operating System name as part of the namespace. -For example, there may be `process.linux`, `process.windows`, or `process.posix` names for metrics and attributes. We will not have root `linux.*`, `windows.*`, or `posix.*` namespaces. This is because of the principle we’re trying to uphold from the [Namespaces section](#namespaces); we still want the instrumentation source to be represented by the root namespace of the attribute/metric. If we had OS root namespaces, different sources like `system`, `process`, etc. could get very tangled within each OS namespace, defeating the intended design philosophy. +For example, there may be `process.linux`, `process.windows`, or `process.posix` +names for metrics and attributes. We will not have root `linux.*`, `windows.*`, +or `posix.*` namespaces. This is because of the principle we’re trying to uphold +from the [Namespaces section](#namespaces); we still want the instrumentation +source to be represented by the root namespace of the attribute/metric. If we +had OS root namespaces, different sources like `system`, `process`, etc. could +get very tangled within each OS namespace, defeating the intended design +philosophy. -[use cases doc](./use-cases.md) +[use cases doc]: ./use-cases.md diff --git a/docs/non-normative/groups/system/use-cases.md b/docs/non-normative/groups/system/use-cases.md index 32183e6298..99d8fabf1b 100644 --- a/docs/non-normative/groups/system/use-cases.md +++ b/docs/non-normative/groups/system/use-cases.md @@ -1,82 +1,115 @@ # **System Semantic Conventions: General Use Cases** -This document is a collection of the use cases that we want to cover with the System Semantic Conventions. The use cases outlined here inform the working group’s decisions around what instrumentation is considered **required**. -Use cases in this document will be stated in a generic way that does not refer to any potentially existing instrumentation in semconv as of writing, such that when we do dig into specific instrumentation, we understand their importance based on our holistic view of expected use cases. +This document is a collection of the use cases that we want to cover with the +System Semantic Conventions. The use cases outlined here inform the working +group’s decisions around what instrumentation is considered **required**. Use +cases in this document will be stated in a generic way that does not refer to +any potentially existing instrumentation in semconv as of writing, such that +when we do dig into specific instrumentation, we understand their importance +based on our holistic view of expected use cases. -## *Legend* +## _Legend_ -`General Information` \= The information that should be discoverable either through the entity, metrics, or metric attributes. +`General Information` \= The information that should be discoverable either +through the entity, metrics, or metric attributes. -`Dashboard` \= The information that should be attainable through metrics to create a comprehensive dashboard. +`Dashboard` \= The information that should be attainable through metrics to +create a comprehensive dashboard. -`Alerts` \= Some examples of common alerts that should be creatable with the available information. +`Alerts` \= Some examples of common alerts that should be creatable with the +available information. ## **Host** -A user should be able to monitor the health of a host, including monitoring resource consumption, unexpected errors due to resource exhaustion or malfunction of core components of a host or fleet of hosts (network stack, memory, CPU, etc.). +A user should be able to monitor the health of a host, including monitoring +resource consumption, unexpected errors due to resource exhaustion or +malfunction of core components of a host or fleet of hosts (network stack, +memory, CPU, etc.). ### General Information -* Machine name -* ID (relevant to its context, could be a cloud provider ID or just base machine ID) -* OS information (platform, version, architecture, etc) -* CPU Information -* Memory Capacity +- Machine name +- ID (relevant to its context, could be a cloud provider ID or just base machine + ID) +- OS information (platform, version, architecture, etc) +- CPU Information +- Memory Capacity ### Dashboard -* Memory utilization -* CPU utilization -* Disk utilization -* Disk throughput -* Network traffic +- Memory utilization +- CPU utilization +- Disk utilization +- Disk throughput +- Network traffic ### Alerts -* VM is down unexpectedly -* Network activity spikes unexpectedly -* Memory/CPU/Disk utilization goes above a % threshold +- VM is down unexpectedly +- Network activity spikes unexpectedly +- Memory/CPU/Disk utilization goes above a % threshold ## Notes -The alerts in particular should be capable of being uniformly applied to a heterogenous fleet of hosts. We will value the nature of cross-platform instrumentation to allow for effective alerting across a fleet regardless of the potential mixture of operating system platforms within it. +The alerts in particular should be capable of being uniformly applied to a +heterogenous fleet of hosts. We will value the nature of cross-platform +instrumentation to allow for effective alerting across a fleet regardless of the +potential mixture of operating system platforms within it. The term `host` can mean different things in other contexts: -* The term `host` in a network context, a central machine that many others are networked to, or the term `host` in a virtualization context -* The term `host` in a virtualization context, something that is hosting virtual guests such as VMs or containers - -In this context, a host is generally considered to be some individual machine, physical or virtual. This can be extra confusing, because a unique machine `host` can also be a network `host` or virtualization `host` at the same time. This is a complexity we will have to accept due to the fact that the `host` namespace is deeply embedded in existing OpenTelemetry instrumentation and general verbiage. To the best of our ability, network and virtualization `host` instrumentation will be kept distinct by being within other namespaces that clearly denote which version of the term `host` is being referred to, while the root `host` namespace will refer to an individual machine. +- The term `host` in a network context, a central machine that many others are + networked to, or the term `host` in a virtualization context +- The term `host` in a virtualization context, something that is hosting virtual + guests such as VMs or containers + +In this context, a host is generally considered to be some individual machine, +physical or virtual. This can be extra confusing, because a unique machine +`host` can also be a network `host` or virtualization `host` at the same time. +This is a complexity we will have to accept due to the fact that the `host` +namespace is deeply embedded in existing OpenTelemetry instrumentation and +general verbiage. To the best of our ability, network and virtualization `host` +instrumentation will be kept distinct by being within other namespaces that +clearly denote which version of the term `host` is being referred to, while the +root `host` namespace will refer to an individual machine. ## **Process** -A user should be able to monitor the health of an arbitrary process using data provided by the OS. -Reasons a user may want this: +A user should be able to monitor the health of an arbitrary process using data +provided by the OS. Reasons a user may want this: -1. The process they want to monitor doesn't have in-process runtime-specific instrumentation enabled or is not instrumentable at all, such as an antivirus or another background process. -2. They are monitoring lots of processes and want to have a set of uniform instrumentation for all of them. -3. Personal preference/legacy reasons; they might already be using OS signals to monitor stuff and it's an easier lift for them to move to basic process instrumentation, then move to other specific semconv over time. +1. The process they want to monitor doesn't have in-process runtime-specific + instrumentation enabled or is not instrumentable at all, such as an antivirus + or another background process. +2. They are monitoring lots of processes and want to have a set of uniform + instrumentation for all of them. +3. Personal preference/legacy reasons; they might already be using OS signals to + monitor stuff and it's an easier lift for them to move to basic process + instrumentation, then move to other specific semconv over time. ### General Information -* Process name -* Pid -* User/owner +- Process name +- Pid +- User/owner ### Dashboard -* Physical Memory usage and/or utilization -* Virtual Memory usage -* CPU usage and/or utilization -* Disk throughput -* Network throughput +- Physical Memory usage and/or utilization +- Virtual Memory usage +- CPU usage and/or utilization +- Disk throughput +- Network throughput ### Alert -* Process stops unexpectedly -* Memory/CPU usage/utilization goes above a threshold -* Memory exclusively rises over a period of time (memory leak detection) +- Process stops unexpectedly +- Memory/CPU usage/utilization goes above a threshold +- Memory exclusively rises over a period of time (memory leak detection) ### Notes -On top of alerts and dashboards, we will also consider the basic benchmarking of a process to be a general usecase. The basic cross platform stats that can be provided in a cross-platform manner can also be effectively used for this, and we will consider that when making decisions about process instrumentation. +On top of alerts and dashboards, we will also consider the basic benchmarking of +a process to be a general usecase. The basic cross platform stats that can be +provided in a cross-platform manner can also be effectively used for this, and +we will consider that when making decisions about process instrumentation. diff --git a/package.json b/package.json index ef069a000a..4177d0e07d 100644 --- a/package.json +++ b/package.json @@ -19,6 +19,17 @@ "through2": "^4.0.2" }, "prettier": { - "proseWrap": "preserve" + "proseWrap": "preserve", + "overrides": [ + { + "files": [ + "**/non-normative/groups/system/**/*.md" + ], + "options": { + "printWidth": 80, + "proseWrap": "always" + } + } + ] } }