Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chore] System Semantic Conventions Non-Normative Guidance #1618

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@
/model/os/ @open-telemetry/specs-semconv-approvers @open-telemetry/semconv-system-approvers
/model/process/ @open-telemetry/specs-semconv-approvers @open-telemetry/semconv-system-approvers @open-telemetry/semconv-security-approvers
/model/system/ @open-telemetry/specs-semconv-approvers @open-telemetry/semconv-system-approvers
/docs/non-normative/groups/system @open-telemetry/specs-semconv-approvers @open-telemetry/semconv-system-approvers

# Mobile semantic conventions
/docs/mobile/ @open-telemetry/specs-semconv-approvers @open-telemetry/semconv-mobile-approvers
Expand Down
4 changes: 4 additions & 0 deletions .prettierignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@
!/docs/cloud*/**
!/docs/attributes-registry*
!/docs/attributes-registry*/**
!/docs/non-normative*
!/docs/non-normative/groups*
!/docs/non-normative/groups/system*
!/docs/non-normative/groups/system*/**
/model
/schemas
CHANGELOG.md
259 changes: 259 additions & 0 deletions docs/non-normative/groups/system/design-philosophy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,259 @@
# System Semantic Conventions: Instrumentation Design Philosophy

The System Semantic Conventions are caught in a strange dichotomy that is unique
among other semconv groups. While we want to make sure we cover obvious generic
use cases, monitoring system health is a very old practice with lots of
different existing strategies. While we can cover the basic use cases in cross
platform ways, we want to make sure that users who specialize in certain
platforms aren't left in the lurch; if users aren't given recommendations for
particular types of data that isn't cross-platform and universal, they may come
up with their own disparate ideas for how that instrumentation should look,
leading to the kind of fracturing that the semantic conventions should be in
place to avoid.

The following sections address some of the most common instrumentation design
questions, and how we as a working group have opted to address them. In some
cases they are unique to the common semantic conventions guidance due to our
unique circumstance, and those cases will be called out specifically.

## Namespaces

Relevant discussions:
[\#1161](https://github.com/open-telemetry/semantic-conventions/issues/1161)

The System Semantic Conventions generally cover the following namespaces:

- `system`
- `process`
- `host`
- `memory`
- `network`
- `disk`
- `memory`
- `os`

Deciding on the namespace of a metric/attribute is generally informed by the
following belief:

**The namespace of a metric/attribute should logically map to the Operating
System concept being considered as the instrumentation source.**

The most obvious example of this is with language runtime metrics and `process`
namespace metrics. Many of these metrics are very similar; most language
runtimes provide some manner of `cpu.time`, `memory.usage` and similar metrics.
If we were considering de-duplication as the top value in our design, it would
follow that `process.cpu.time` and `process.memory.usage` should simply be
referenced by any language runtime that might produce those metrics. However, as
a working group we believe it is important that `process` namespace and runtime
namespace metrics remain separate, because `process` metrics are meant to
represent an **OS-level process as the instrumentation source**, whereas runtime
metrics represent **the language runtime as the instrumentation source**.

In some cases this is simply a matter of making the instrumentation's purpose as
clear as possible, but there are cases where attempts to share definitions
across distinct instrumentation sources poses the potential for a clash. The
concrete example of a time we accepted this consequence is with `cpu.mode`; the
decision was to
[unify all separate instances of `*.cpu.state` attributes into one shared `cpu.mode` attribute](https://github.com/open-telemetry/semantic-conventions/issues/1139).
The consequence of this is that `cpu.mode` needs to have a broad enum in its
root definition, with special exemptions in each different `ref` of `cpu.mode`,
since `cpu.mode` used in `process.cpu.time` vs `container.cpu.time` vs
`system.cpu.time` etc. has different subsets of the overall enum values. We
decided as a group to accept the consequence in this case, however it isn't
something we're keen on dealing with all over system semconv, as the
instrumentation ends up polluted with so many edge cases in each namespace that
it defeats the purpose of sharing the attribute in the first place.

## Two Class Design Strategy

Relevant discussions:
[\#1403 (particular comment)](https://github.com/open-telemetry/semantic-conventions/issues/1403#issuecomment-2368815634)

We are considering two personas for system semconv instrumentation. If we have a
piece of instrumentation, we decide which persona it is meant for and use that
to make the decision for how we should name/treat that piece of instrumentation.

### General Class: A generalized cross-platform use case we want any user of instrumentation to be able to easily access

When instrumentation is meant for the General Class, we will strive to make the
names and examples as prescriptive as possible. This instrumentation is what
will drive the most important use cases we really want to cover with the system
semantic conventions. Things like dashboards, alerts, and broader o11y setup
tutorials will largely feature General Class instrumentation covering the [basic
use cases][use cases doc] we have laid out as a group. We want this
instrumentation to be very clear exactly how and when they should be used.
General Class instrumentation will be recommended as **on by default**.

### Specialist Class: A more specific use case that specialists could enable to get more in-depth information that they already understand how to use

When instrumentation falls into the Specialist Class, we are assuming the target
audience is already familiar with the concept and knows exactly what they are
looking for and why. The goal for Specialist Class instrumentation is to ensure
that users who have very specific and detailed needs are still covered by our
semantic conventions so they don't need to go out of their way coming up with
their own, risking the same kind of disparate instrumentation problem that
semantic conventions are intended to solve. The main differences in how we
handle Speciialist Class instrumentation are:

1. The names and resulting values will map directly to what a user would expect
hunting down the information themselves. We will rarely be prescriptive in
how the information should be used or how it should be broken down. For
example, a metric to represent a process's cgroup would have the resulting
value match exactly to what the result would be if the user called
`cat /proc/PID/cgroup`.
2. If a piece of instrumentation is specific to a particular operating system,
the name of the operating system will be in the instrumentation name. See
[Operating System in names](#operating-system-in-names) for more information.
For example, a metric for a process's cgroup would be `process.linux.cgroup`,
given that cgroups are a specific Linux kernel feature.

### Examples

Some General Class examples:

- Memory/CPU usage and utilization metrics
- General disk and network metrics
- Universal system/process information (names, identifiers, basic specs)

Some Specialist Class examples:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the whole description of the rationale here is exactly how it should be, I think we miss the part of having a set of rules/guidelines/sanity-checks that would help somebody in the future to decide into which directory a metric or attribute fall into. This might not be quite easy to define because of the nature of this problem but maybe it would worth adding a section in the bottom suggesting how this kind of situations should be handled in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do have a case study below for process.linux.cgroup; perhaps I can adapt this to more general rules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 487af83


- Particular Linux features like special process/system information in procfs
(see things like
[/proc/meminfo](https://man7.org/linux/man-pages/man5/proc_meminfo.5.html) or
[cgroups](https://man7.org/linux/man-pages/man7/cgroups.7.html))
- Particular Windows features like special process information (see things like
[Windows Handles](https://learn.microsoft.com/windows/win32/sysinfo/about-handles-and-objects),
[Process Working Set](https://learn.microsoft.com/windows/win32/procthread/process-working-set))
- Niche process information like open file descriptors, page faults, etc.

## Instrumentation Design Guide

When designing new instrumentation we will follow these steps as closely as
possible:

### Choosing Instrumentation Class

In System Semantic Conventions, the most important questions when deciding
whether a piece of instrumentation is General or Specialist would be:

- Is it cross-platform?
- Does it support our [most important use cases][use cases doc] then we will
make it general class

The answer to both these questions will likely need to be "Yes" for the
instrumentation to be considered General Class. Since the General Class
instrumentation is what we expect the widest audience to use, we will need to
scrutinize it more closely to ensure all of it is as necessary and useful as
possible.

If the answer to either one of these is "No", then we will likely consider it
Specialist Class.

### Naming

For General Class, choose a name that most accurately descibes the general
concept without biasing to a platform. Lean towards simplicity where possible,
as this is the instrumentation that will be used by the widest audience; we want
it to be as clear to understand and ergonomic to use as possible.

For Specialist Class, choose a name that most directly matches the words
generally used to describe the concept in context. Since this instrumentation
will be optional, and likely sought out by the people who already know exactly
what they want out of it, we can prioritize matching the names as closely to
their definition as possible. For specialist class metrics that are platform
exclusive, we will include the OS in the namespace as a sub-namespace (not the
root namespace) if it is unlikely that the same metric name could ever be
applied in a cross-platform manner. See
[this section](#operating-system-in-names) for more details.

### Value

For General Class, the value we can be prescriptive with the value of the
instrumentation. We want to ensure General Class instrumentation most closely
matches our vision for our general use cases, and we want to ensure that users
who are not specialists and just want the most important basic information can
acquire it as easily as possible using out-of-the-box semconv instrumentation.
This means we are more likely within General Class instrumentation to make
judgements about exactly what the value should be, and whether the value should
be reshaped by instrumentation in any case when pulling the values from sources
if it serves general purpose use cases.

For Specialist Class, we should strive not to be prescriptive and instead match
the concept being modeled as closely as possible. We expect specialist class
instrumentation to be enabled by the people who already understand it. In a
System Semconv context, these may be things a user previously gathered manually
or through existing OS tools that they want to model as OTLP.

### Case study: `process.cgroup`

Relevant discussions:
[\#1357](https://github.com/open-telemetry/semantic-conventions/issues/1357),
[\#1364 (particular thread)](https://github.com/open-telemetry/semantic-conventions/pull/1364#discussion_r1730743509)

In the `hostmetricsreceiver`, there is a Resource Attribute called
`process.cgroup`. How should this metric be adopted in System Semantic
Conventions?

Based on our definitions, this attribute would fall under Specialist Class:

- `cgroups` are a Linux-specific feature
- It is not directly part of any of the default out-of-the-box usecases we want
to cover

In this attribute's case, there are two important considerations when deciding
on the name:

- The attribute is specialist class
- It is Linux exclusive, and is unlikely to ever be introduced in other
operating systems since the other major platforms have their own versions of
it (Windows Job Objects, BSD Jails, etc)

This means we should pick a name that matches the verbiage used by specialists
in context when referring to this concept. The way you would refer to this would
be "a process's cgroup, collected from `/proc/<pid>/cgroup`". So we would start
with the name `process.cgroup`. We also determined that this attribute is
Linux-exclusive and are confident it will remain as such, so we land on the name
`process.linux.cgroup`.

Since this metric falls under Specialist Class, we don't want to be too
prescriptive about the value. A user who needs to know the `cgroup` of a process
likely already has a pretty good idea of how to interpret it and use it further,
and it would not be worth it for this Working Group to try and come up with
every possible edge case for how it might be used. It is much simpler for this
attribute, insofar as it falls under our purview, to simply reflect the value
from the OS, i.e. the direct value from `cat /proc/<pid>/cgroup`. With cgroups
in particular, there is high likelihood that more specialized semconv
instrumentation could be developed, particularly in support of more specialized
container runtime or systemd instrumentation. It's more useful for a working
group developing special instrumentation that leverages cgroups to be more
prescriptive about how the cgroup information should be interpreted and broken
down with more specificity.

## Operating System in names

Relevant discussions:
[\#1255](https://github.com/open-telemetry/semantic-conventions/issues/1255),
[\#1364](https://github.com/open-telemetry/semantic-conventions/pull/1364#discussion_r1852465994)

Monitoring operating systems is an old practice, and there are lots of ways to
skin the cat within different platforms. There are lots of metrics, even in
basic stuff like memory usage, where there are platform specific pieces of
information that are valuable to those who really specialize in that platform.

Thus we have decided that any instrumentation that is:

1. Specific to a particular operating system
2. Not meant to be part of what we consider our most important general use cases

will have the Operating System name as part of the namespace.

For example, there may be `process.linux`, `process.windows`, or `process.posix`
names for metrics and attributes. We will not have root `linux.*`, `windows.*`,
or `posix.*` namespaces. This is because of the principle we’re trying to uphold
from the [Namespaces section](#namespaces); we still want the instrumentation
source to be represented by the root namespace of the attribute/metric. If we
had OS root namespaces, different sources like `system`, `process`, etc. could
get very tangled within each OS namespace, defeating the intended design
philosophy.
Comment on lines +250 to +257
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what would be specific problems if we gave up on the prefix and use OS name as a root?

I'm trying to document naming patterns we have in #1708

and I'm actually struggling to understand what benefit the domain prefix brings.

Copy link
Contributor

@lmolkova lmolkova Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. what should I do if I want to describe a property of OS that's indifferent to instrumentation point/source? which namespace would I use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to refine this to express it more concisely, but in the interest of moving the conversation forward I'm going to braindump everything I have here. I'd like to work together to ensure what I'm saying makes sense and can potentially be refined to be easier to follow.


My concerns right now are theoretical; if they're unfounded for some reason I can revisit it. My reasoning was first laid out in this comment on the cgroup PR: #1364 (comment)

Within the semconv that I am familiar with, the root namespace is what organizes instrumentation into the categories they're meant for. http means that these are signals related to http, db means these are signals about databases, so on so forth. In System Semconv, we took this a step further because a single system category would have been too broad and contained too many disparate concepts within it, i.e. we would have had a bunch of system.process, system.memory etc. which would have erased the significance of a system namespace in the first place [1]. So our instrumentation is separated into multiple root namespaces where each root namespace represents the source of the instrumentation, i.e. process namespace is for instrumentation that is about the operating system's concept of a process and so on.

With that context in mind, the issues I have with the OS name as a root namespace are:

  • It presents a similar problem of a category being too broad
    • In fairness, it is at least more useful than system containing everything would have been because it would still present some kind of information (linux root namespace would mean this is linux only). However, I think it's still too broad a category, and we'd end up with lots of instrumentation unrelated to each other being within the namespace, i.e. we'd have linux.process, linux.memory, linux.network etc.
  • It separates related instrumentation from each other
    • The benefit of the instrumentation source being the root namespace is that a user who wants to know all the possible instrumentation related to that source only needs to look in one namespace. If platform exclusive metrics were placed into platform namespaces, then to find all existing memory related instrumentation, the user would need to realize they need to look in two namespaces, the memory namespace and the namespace for their platform (and the namespace for their platform would contain lots of other stuff not related to memory).

Here's the way I think of it as generic as I can manage:

The end of a semconv name is like an object within a category. The end of the name is basically like saying this is what the name actually represents. In the Collector, where many semconv transitions have not yet happened, much instrumentation doesn't have these namespaces because the receiver they are found within is already a form of organization; if I need to know instrumentation about something, I check the receiver related to that thing and look what's there. Within semconv, the decision was made for everything to be namespaced. This makes sense in a general environment, where you aren't inherently structured and need names to contain organizational context so that you can find the instrumentation you're looking for in a sea of other telemetry.

Given that, I see the goal of the namespaces being logical organization. This means the namespaces should be in order of categorical importance. The "importance" is considered recursively for each sub-namespace.

I'm going to demonstrate this with a name picked at random-ish[2]: go.memory.gc.goal

I'm considering the "identity" of the name to be goal, and each element before that to be a namespace.

You could look at the organization of the name in two directions, and I think it needs to make sense from both directions to be an effective name.

Starting from the identity backwards:
What is goal referring to? It's a garbage collection goal. Garbage collection is a memory management concept in go. Thus the name makes sense in that direction, as goal is contained within gc which is within memory which is within go.

Starting from the root namespace:
I want to know about garbage collection of my Go program. The category that makes the most sense would be go since that's the runtime I want to know about. I want to know about the memory of my Go program, in particular the gc goal. The namespaces are ordered in a way that makes sense for me to discover that information.

To demonstrate the negative example, I could reorder this name to be: memory.go.gc.goal

Starting from the identity backwards:
What is goal referring to? It's a garbage collection goal. This is the garbage collection of a Go program. This is a general memory concept. This kind of works, you can still understand what the goal identity means, but it is a bit broken in the other direction.

Starting from the root namespace:
I want to know about garbage collection of my Go program. If I start with the memory category, there are lots of other unrelated memory metrics within it, within which I need to find the go sub-namespace first before being able to find garbage collection goal. In this case, because the less important memory namespace is used as the root, the category ends up being very broad and finding my Go metric means wading through a lot of things that are not related.

In different contexts, determining what is the "most important" category to use as the root namespace is somewhat subjective. Within System Semconv we came up with a pretty reliable rule, which is that the root namespace represents the source of the instrumentation. In the go.memory.gc.goal case, go (the runtime name) is the source of the instrumentation, so that rule kind of works here too. But I'm not sure how well we can guarantee that rule will generically apply.


what should I do if I want to describe a property of OS that's indifferent to instrumentation point/source? which namespace would I use?

I think in that case I actually consider the instrumentation source to be either the operating system or the general system. So if I had a Windows-exclusive metric that is about the Windows Operating System itself, and there's no cross platform name I could use, I'd probably still use the os namespace, i.e. os.windows.<identity>.


Footnotes:

[1] We might be the among first single working groups to need to do this, but it's not the first time the problem has been encountered in semconv. runtime opted to do the same thing as well, since runtime.jvm, runtime.nodejs, runtime.go etc. essentially erases the usefulness of runtime as a root category.

[2] I did intentionally pick a runtime metric that looked like it might clash with other namespaces, since that's something that's come up for our group as well.

Copy link
Contributor

@lmolkova lmolkova Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few points:

  1. The thing being reported is more important than instrumentation source. I hope that current instrumentations are temporary and we'll see more and more native ones. When you think about native instrumentations, things change. E.g.

    • windows itself emits tons of metrics/events. Should they all be reported under os ? Some of them describe system performance, some describe user behavior. Do we group all of them under os? It seems redundant. Or just some of them? How do we decides?
    • any database has management/control plane operations that have nothing to do with DB features (auth, permissions, connection management, scaling, etc). Do we report them as db.{mydb}.* attributes/metrics. E.g. db.cassandra.paxos.prepare.duration - protocol is orthogonal to the DB features of cassandra, but the metric is still about cassandra. Now, do we report db.cassandra.compaction.something (because it's about database) and cassandra.paxos.prepare.duration because it's about protocol?

    TL;DR: any specific system/client lib has features that belong to more than one root namespace. How instrumentation is done may change (from specific collector component to native one), but metrics we define should survive it.

  2. Having common root namespace for the "General Class" makes perfect sense to me: everything common about OS goes under system, everything common about databases goes under db.

  3. I'm challenging the "Specialist Class" naming: I'm reporting different metrics related to jvm, everyone who cares knows that jvm is a runtime, runtime in front of it is redundant. If I care about cassandra-specific metrics, I'm no longer in DB domain - cassandra is the root namespace and everything about cassandra goes there

I.e. How strong do we feel about

For example, a metric for a process's cgroup would be `process.linux.cgroup`,
given that cgroups are a specific Linux kernel feature.

For example, a metric for a process's cgroup would be process.linux.cgroup,
given that cgroups are a specific Linux kernel feature.

Vs linux.cgroup ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[UPDATE]:

The process.linux.cgroup seems to be tricky since process is meaningful there. cgroup is a property of the (linux) process. I believe it's less tricky in most of other semconv cases. Let me see.

Copy link
Contributor Author

@braydonk braydonk Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't cover well enough the point you brought up, which is that there may be a case where the platform/domain name becomes so specific and disparate in itself that it warrants being considered a unique instrumentation source/root namespace/category.

Taking the db.cassandra example, I imagine it would follow a process like this hypothetical scenario:

  1. db.* may have a large number of attributes that can be used cross platform, and our guidance largely expects that they are used as much as possible. For stuff specific to Cassandra some db.cassandra.* stuff starts to get sprinkled into the namespace when necessary.
  2. Overtime more Cassandra specialists join in and start adding more and more db.cassandra attributes. It begins to pollute the db category, such that the instrumenting Cassandra in particular has a far greater variety of things very specific to it. The fact that so many db.cassandra.* things are being added implies that these things are Cassandra specific already, since they should have been using generic attributes when possible.
  3. A decision is made that since Cassandra-specific instrumentation has become so rich and disparate from everything else in the db.* namespace, it makes the most sense to split cassandra out into its own root namespace to represent the richness of the instrumentation available for it and how much it diverts.

Applying this thought process to the questions in your above comment:

windows itself emits tons of metrics/events. Should they all be reported under os ?

I think the steps from above could still occur in the same order. We'd try to use as many generic attributes as possible, introducing os.windows.* where it is specifically necessary. If the os.windows category becomes so deep and disparate from everything else, then it might make sense to move it into its own namespace.

How strong do we feel about process.linux.cgroup vs linux.cgroup

It is essentially for the reason you stated in your update; this is a bit of a different case because whether Linux exclusive or not, the process will always be the instrumentation source. And where something like os and db may be more like categories than an explicit object being instrumented, process will always be the most important root namespace and linux.cgroup on its own wouldn't carry the same semantic meaning.

That being said, there is a case to be made for cgroup here, as cgroup is itself a potentially instrumentable source. It's (hopefully) only a matter of time before semantic conventions come along to instrument cgroups in more detail. In that case, I wonder if that would start in linux.cgroup or if it would just be cgroup off the bat. That I'm not sure about.


The problem I foresee with my own ideas here is it assumes things can be moved around easier. I think the problem with my thought process is that it sort of requires foresight to make sure that there isn't the potential for us to want to extract a platform namespace before reaching the point of stability for another namespace. So all of this probably needs some more thought and discussion. Maybe when I come back from the holidays I'll have the answer (probably not but I can dream 😃).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that some of the things we're discussing (db.cosmosdb) is reasonably close to stability, I think we should try to envision as many things as we can - we might not have another chance in the next few years 🤞

I don't think we need a rigid naming policy - i.e. process.linux.cgroup can be whatever makes the most sense for it.
I'll bring up the general naming to Semonv SIG after the break.

Happy holidays!


[use cases doc]: ./use-cases.md
115 changes: 115 additions & 0 deletions docs/non-normative/groups/system/use-cases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# **System Semantic Conventions: General Use Cases**

This document is a collection of the use cases that we want to cover with the
System Semantic Conventions. The use cases outlined here inform the working
group’s decisions around what instrumentation is considered **required**. Use
cases in this document will be stated in a generic way that does not refer to
any potentially existing instrumentation in semconv as of writing, such that
when we do dig into specific instrumentation, we understand their importance
based on our holistic view of expected use cases.

## _Legend_

`General Information` \= The information that should be discoverable either
through the entity, metrics, or metric attributes.

`Dashboard` \= The information that should be attainable through metrics to
create a comprehensive dashboard.

`Alerts` \= Some examples of common alerts that should be creatable with the
available information.

## **Host**

A user should be able to monitor the health of a host, including monitoring
resource consumption, unexpected errors due to resource exhaustion or
malfunction of core components of a host or fleet of hosts (network stack,
memory, CPU, etc.).

### General Information

- Machine name
- ID (relevant to its context, could be a cloud provider ID or just base machine
ID)
- OS information (platform, version, architecture, etc)
- CPU Information
- Memory Capacity

### Dashboard

- Memory utilization
- CPU utilization
- Disk utilization
- Disk throughput
- Network traffic

### Alerts

- VM is down unexpectedly
- Network activity spikes unexpectedly
- Memory/CPU/Disk utilization goes above a % threshold

## Notes

The alerts in particular should be capable of being uniformly applied to a
heterogenous fleet of hosts. We will value the nature of cross-platform
instrumentation to allow for effective alerting across a fleet regardless of the
potential mixture of operating system platforms within it.

The term `host` can mean different things in other contexts:

- The term `host` in a network context, a central machine that many others are
networked to, or the term `host` in a virtualization context
- The term `host` in a virtualization context, something that is hosting virtual
guests such as VMs or containers

In this context, a host is generally considered to be some individual machine,
physical or virtual. This can be extra confusing, because a unique machine
`host` can also be a network `host` or virtualization `host` at the same time.
This is a complexity we will have to accept due to the fact that the `host`
namespace is deeply embedded in existing OpenTelemetry instrumentation and
general verbiage. To the best of our ability, network and virtualization `host`
instrumentation will be kept distinct by being within other namespaces that
clearly denote which version of the term `host` is being referred to, while the
root `host` namespace will refer to an individual machine.

## **Process**

A user should be able to monitor the health of an arbitrary process using data
provided by the OS. Reasons a user may want this:

1. The process they want to monitor doesn't have in-process runtime-specific
instrumentation enabled or is not instrumentable at all, such as an antivirus
or another background process.
2. They are monitoring lots of processes and want to have a set of uniform
instrumentation for all of them.
3. Personal preference/legacy reasons; they might already be using OS signals to
monitor stuff and it's an easier lift for them to move to basic process
instrumentation, then move to other specific semconv over time.

### General Information

- Process name
- Pid
- User/owner

### Dashboard

- Physical Memory usage and/or utilization
- Virtual Memory usage
- CPU usage and/or utilization
- Disk throughput
- Network throughput

### Alert

- Process stops unexpectedly
- Memory/CPU usage/utilization goes above a threshold
- Memory exclusively rises over a period of time (memory leak detection)

### Notes

On top of alerts and dashboards, we will also consider the basic benchmarking of
a process to be a general usecase. The basic cross platform stats that can be
provided in a cross-platform manner can also be effectively used for this, and
we will consider that when making decisions about process instrumentation.
Loading
Loading