layout | parent | grand_parent |
---|---|---|
default |
Checks |
Documentation |
This check will query Prometheus servers, it is used to warn about queries
that are using metrics not currently present in Prometheus.
It parses expr
query from every rule, finds individual metric selectors and
runs a series of checks for each of them.
Let's say we have a rule this query: sum(my_metric{foo="bar"}) > 10
.
This checks would first try to determine if my_metric{foo="bar"}
returns anything via instant query and if it doesn't it will try
to determine why, by checking if:
my_metric
metric was ever present in Prometheusmy_metric
was present but disappearedmy_metric
has any series withfoo
labelmy_metric
has any series matchingfoo="bar"
Metrics that are wrapped in ... or vector(0)
won't be checked, since
the intention of adding or vector(0)
is to provide a fallback value
when there are no matching time series.
Example:
- alert: Foo
expr: sum(my_metric or vector(0)) > 1
If you see this check complaining about some metric it's might due to a number of different issues. Here are some usual cases.
Prometheus itself exposes metrics about active alerts.
And it's possible to use those metrics in recording or alerting rules.
If pint finds a query using either ALERTS{alertname="..."}
or
ALERTS_FOR_STATE{alertname="..."}
selector it will check if there's
alerting rule with matching name defined. For queries that don't pass any
alertname
label filters it will skip any further checks.
If a metric isn't present in Prometheus but pint finds a recording rule with matching name then it will emit a warning and skip further checks.
Example with alert rule that depends on two recording rules:
# count the number of targets per job
- record: job:up:count
expr: count(up) by(job)
# total number of targets that are healthy per job
- record: job:up:sum
expr: sum(up) by(job)
# alert if <50% of targets are down
- alert: Too Many Targets Are Down
expr: (job:up:sum / job:up:count) < 0.5
If all three rules where added in a single PR and pint didn't try to match
metrics to recording rule then pint ci
would block such PR because metrics
this alert is using are not present in Prometheus.
To avoid this pint will only emit a warning, to make it obvious that it was unable to run a full set of checks, but won't report any problems.
For best results you should split your PR and first add all recording rules before adding the alert that depends on it. Otherwise pint might miss some problems like label mismatch.
- You are trying to use a metric that is not present in Prometheus at all.
- Service exporting your metric is not working or no longer being scraped.
- You are querying wrong Prometheus server.
- You are trying to filter a metric that exists using a label key that is never present on that metric.
- You are using label value as a filter, but that value is never present.
If that's the case you need to fix you query. Make sure your metric is present and it has all the labels you expect to see.
Some time series for the same metric will have label foo
and some won't.
Although there's nothing technically wrong with this and Prometheus allows
you to do so, this makes querying metrics difficult as results containing
label foo
will be mixed with other results not having that label.
All queries would effectively need a {foo!=""}
or {foo=""}
filter to
select only one variant of this metric.
Best solution here is to fix labelling scheme.
Some label values will appear only temporarily, for example if metrics are generated for serviced HTTP request and they include some details of those requests that cannot be known ahead of time, like request path or method.
When possible this can be addressed by initialising metrics with all known label values to zero on startup:
func main() {
myMetric = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"code"},
)
myMetric.WithLabelValues("2xx").Set(0)
myMetric.WithLabelValues("3xx").Set(0)
myMetric.WithLabelValues("4xx").Set(0)
myMetric.WithLabelValues("5xx").Set(0)
}
If that's not doable you can let pint know that it's not possible to validate those queries by disabling this check. See below for instructions on how to do that.
This check supports setting extra configuration option to fine tune its behaviour.
Syntax:
check "promql/series" {
ignoreMetrics = [ "(.*)", ... ]
}
lookbackRange
- how far back to query when checking if given metric was ever present in Prometheus. Default is7d
, meaning that if a metric is missing pint will query last 7 days of metrics to tell you if this metric was ever present and if so, when was it last seen.lookbackStep
- look-back query resolution. Default is5m
which matches Prometheus default staleness checks. If you have a custom--query.lookback-delta
flag passed to Prometheus you might want to set this option to the same value.ignoreMetrics
- list of regexp matchers, if a metric is missing from Prometheus but the name matches any of provided regexp matchers then pint will only report a warning, instead of a bug level report.
Example:
check "promql/series" {
lookbackRange = "5d"
lookbackStep = "1m"
ignoreMetrics = [
".*_error",
".*_error_.*",
".*_errors",
".*_errors_.*",
]
}
But default this check will report a problem if a metric was present in Prometheus but disappeared at least two hours ago. You can change this duration per Prometheus rule by adding a comment around it. Syntax:
To set min-age
for all metrics in a query:
# pint rule/set promql/series min-age $duration
Duration must follow syntax documented here.
To set min-age
for specific metric:
# pint rule/set promql/series($metric_name) min-age $duration
Example:
- record: ...
# Report problems if any metric in this query is missing for at least 3 days
# pint rule/set promql/series min-age 3d
expr: sum(foo) / sum(bar)
- record: ...
# Report problems if:
# - metric "foo" is missing for at least 1 hour (defaults)
# - metric "bar{instance=xxx}" is missing for at least 4 hours
# pint rule/set promql/series(bar{instance="xxx"}) min-age 4h
expr: sum(foo) / sum(bar{instance="xxx"})
By default pint will report a problem if a rule uses query with a label filter and the value of that filter query doesn't match anything.
For example rate(http_errors_total{code="500"}[2m])
will report a problem
if there are no http_errors_total
series with code="500"
.
The goal here is to catch typos in label filters or labels with values that got renamed, but in some cases this will report false positive problems, especially if label values are exported dynamically, for example after HTTP status code is observed.
In the http_errors_total{code="500"}
example if code
label is generated
based on HTTP responses then there won't be any series with code="500"
until
there's at least one HTTP response that generated this code.
You can relax pint checks so it doesn't validate if label values for specific labels are present on any time series.
Syntax:
# pint rule/set promql/series ignore/label-value $labelName`
Example:
- alert: ...
# disable code label checks for all metrics used in this rule
# pint rule/set promql/series ignore/label-value code
expr: rate(http_errors_total{code="500"}[2m]) > 0.1
- alert: ...
# disable code label checks for http_errors_total metric
# pint rule/set promql/series(http_errors_total) ignore/label-value code
expr: rate(http_errors_total{code="500"}[2m]) > 0.1
- alert: ...
# disable code label checks only for http_errors_total{code="500"} queries
# pint rule/set promql/series(http_errors_total{code="500"}) ignore/label-value code
expr: rate(http_errors_total{code="500"}[2m]) > 0.1
This check is enabled by default for all configured Prometheus servers.
Example:
prometheus "prod" {
uri = "https://prometheus-prod.example.com"
timeout = "60s"
include = [
"rules/prod/.*",
"rules/common/.*",
]
}
prometheus "dev" {
uri = "https://prometheus-dev.example.com"
timeout = "30s"
include = [
"rules/dev/.*",
"rules/common/.*",
]
}
You can disable this check globally by adding this config block:
checks {
disabled = ["promql/series"]
}
You can also disable it for all rules inside given file by adding a comment anywhere in that file. Example:
# pint file/disable promql/series
Or you can disable it per rule by adding a comment to it. Example:
# pint disable promql/series`
If you want to disable only individual instances of this check you can add a more specific comment.
# pint disable promql/series($prometheus)
Where $prometheus
is the name of Prometheus server to disable.
Example:
# pint disable promql/series(prod)
You can also disable promql/series
for specific metric using
# pint disable promql/series($selector)
comment.
Just like with PromQL if a selector doesn't have any matchers then it will match all instances,
Example:
- alert: foo
# Disable promql/series for any instance of my_metric_name metric selector
# pint disable promql/series(my_metric_name)
expr: my_metric_name{instance="a"} / my_metric_name{instance="b"}
To disable individual selectors you can pass matchers.
Example:
- alert: foo
# Disable promql/series only for my_metric_name{instance="a"} metric selector
# pint disable promql/series(my_metric_name{instance="a"})
expr: my_metric_name{instance="a"} / my_metric_name{instance="b"}
Matching is done the same way PromQL matchers work - if the selector from the query has more matchers than the comment the it will be still matched.
Example:
- alert: foo
# Disable promql/series for any selector at least partially matching {job="dev"}
# pint disable promql/series({job="dev"})
expr: my_metric_name{job="dev", instance="a"} / other_metric_name{job="dev", instance="b"}
You can disable this check until given time by adding a comment to it. Example:
# pint snooze $TIMESTAMP promql/series
Where $TIMESTAMP
is either use RFC3339
formatted or YYYY-MM-DD
.
Adding this comment will disable promql/series
until $TIMESTAMP
, after that
check will be re-enabled.