diff --git a/.mdox.yaml b/.mdox.yaml index f28ece2..9626a10 100644 --- a/.mdox.yaml +++ b/.mdox.yaml @@ -38,6 +38,12 @@ transformations: weight: 2 pre: + - glob: "Products/OpenshiftMonitoring/instrumentation.md" + frontMatter: + template: | + title: "{{ .Origin.FirstHeader }}" + lastmod: "{{ .Origin.LastMod }}" + weight: 5 - glob: "Products/OpenshiftMonitoring/collecting_metrics.md" frontMatter: template: | diff --git a/content/Products/OpenshiftMonitoring/instrumentation.md b/content/Products/OpenshiftMonitoring/instrumentation.md new file mode 100644 index 0000000..b8b2aeb --- /dev/null +++ b/content/Products/OpenshiftMonitoring/instrumentation.md @@ -0,0 +1,82 @@ +# Instrumentation guidelines + +This document details good practices to adopt when you instrument your application for Prometheus. It is not meant to be a replacement of the [upstream documentation](https://prometheus.io/docs/practices/instrumentation/) but an introduction focused on the OpenShift use case. + +## Targeted audience + +This document is intended for OpenShift developers that want to instrument their operators and operands for Prometheus. + +## Getting started + +To instrument software written in Golang, see the official [Golang client](https://pkg.go.dev/github.com/prometheus/client_golang). For other languages, refer to the [curated list](https://prometheus.io/docs/instrumenting/clientlibs/#client-libraries) of client libraries. + +Prometheus stores all data as time series which are a stream of timestamped values (samples) identified by a metric name and a set of unique labels (a.ka. dimensions or key/value pairs). Its data model is described in details in this [page](https://prometheus.io/docs/concepts/data_model/). Time series would be represented like this: + +``` +# HELP http_requests_total Total number of HTTP requests by method and handler. +# TYPE http_requests_total counter +http_requests_total{method="GET", handler="/messages"} 500 +http_requests_total{method="POST", handler="/messages"} 10 +``` + +Prometheus supports 4 [metric types](https://prometheus.io/docs/concepts/metric_types/): +* Gauge which represents a single numerical value that can arbitrarily go up and down. +* Counter, a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. When querying a counter metric, you usually apply a `rate()` or `increase()` function. +* Histogram which represents observations (usually things like request durations or response sizes) and counts them in configurable buckets. +* Summary which represents observations too but it reports configurable quantiles over a (fixed) sliding time window. In practice, they are rarely used. + +Adding metrics for any operation should be part of the code review process like any other factor that is kept in mind for production ready code. + +To learn more about when to use which metric type, how to name metrics and how to choose labels, read the following documentation: +* [Prometheus naming recommendations](https://prometheus.io/docs/practices/naming/) +* [Prometheus instrumentation](https://prometheus.io/docs/practices/instrumentation/) +* [Kubernetes metric instrumentation guide](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/metric-instrumentation.md) +* [Instrumenting a Go application for Prometheus](https://prometheus.io/docs/guides/go-application/) + +## Example + +Here is a fictional Go code example instrumented with a Gauge metric and a multi-dimensional Counter metric: + +```golang + cpuTemp := prometheus.NewGauge(prometheus.GaugeOpts{ + Name: "cpu_temperature_celsius", + Help: "Current temperature of the CPU.", + }) + + hdFailures := prometheus.NewCounterVec( + prometheus.CounterOpts{ + Name: "hd_errors_total", + Help: "Number of hard-disk errors.", + }, + []string{"device"}, + )} + + reg := prometheus.NewRegistry() + reg.MustRegister(cpuTemp, m.hdFailures) + + cpuTemp.Set(55.2) + + // Record 1 failure for the /dev/sda device. + hdFailures.With(prometheus.Labels{"device":"/dev/sda"}).Inc() + // Record 3 failures for the /dev/sdb device. + hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc() + hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc() + hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc() +``` + +## Labels + +Defining when to add and when not to add a label to a metric is a [difficult choice](https://prometheus.io/docs/practices/instrumentation/#use-labels). The general rule is: the fewer labels, the better. Every unique combination of label names and values creates a new time series and Prometheus memory usage is mostly driven by the number of times series loaded into RAM during ingestion and querying. A good rule of thumb is to have less than 10 time series per metric name and target. A common mistake is to store dynamic information such as usernames, IP addresses or error messages into a label which can lead to thousands of time series. + +Labels such as `pod`, `service`, `job` and `instance` shouldn't be set by the application. Instead they are discovered at runtime by Prometheus when it queries the Kubernetes API to discover which targets should be scraped for metrics. + +## Custom collectors + +It is sometimes not feasible to use one of the 4 Metric types, typically when your application already has the information stored for other purpose (for instance, it maintains a list of custom objects retrieved from the Kubernetes API). In this case, the [custom collector](https://pkg.go.dev/github.com/prometheus/client_golang@v1.20.4/prometheus#hdr-Custom_Collectors_and_constant_Metrics) pattern can be useful. + +You can find an example of this pattern in the [github.com/prometheus-operator/prometheus-operator](https://github.com/prometheus-operator/prometheus-operator/blob/3df0811bdc7c046cb283006d94092e42219a0e2f/pkg/operator/operator.go#L166-L191) project. + +## Next steps + +* [Collect metrics](collecting_metrics.md) with Prometheus. +* [Configure alerting](alerting.md) with Prometheus.