[BUG] AKS Managed prometheus - Big discrepancy between portal metrics and prometheus metrics #4696

grzesuav · 2024-12-10T09:43:40Z

Describe the bug
There are few metrics:

Active time series in Azure portal
Active time series % utilization in Azure portal
scrape_samples_scraped from prometheus - which is the number of samples the target exposed.

I cannot correlate the first two (from azure portal) with prometheus one.

To Reproduce

Expected behavior

Those metrics should be in line, there is no other way currently to see the number of metrics in each job other than scrape_samples_scraped metric.

Which should be trusted ?

Additional context

Related #4159

The text was updated successfully, but these errors were encountered:

grzesuav · 2024-12-10T09:46:32Z

Continuation of topic started in #4689 @vishiy @aritraghosh

vishiy · 2024-12-11T22:09:02Z

@grzesuav - how is scrape_samples_scraped equal to time-series count ? one is samples scraped (not ingested, as many could be dropped due to relabelings) and another is unique time-series (not samples).
To compare samples scraped with per minute AMW ingestion quota usage -
You should try -

sum(sum_over_time(scrape_samples_post_metric_relabeling [1m] ))

and compare with the metrics chart in the portal (for the samples/min ingested metric and not time-series).

grzesuav · 2024-12-23T11:06:13Z

I use this query and have this plot for last 7 days. I cannot explain the drop for controlplane-apiserver job as there were no activity on our side with any of the config

here is the azure portal metric

grzesuav · 2024-12-23T11:11:02Z

@vishiy

how I can check cardinality of particular metric to disable it ?
how I can check what is being throttled by AKS Prometheus ?

grzesuav · 2024-12-27T10:02:24Z

Also while extending the query per cluster and for wider timeline- it seems something changed on 20.12 which caused significant drop of metrics in centralus. Was there any fix on aks/managed prometheus side applied around that date ?

grzesuav · 2024-12-30T14:08:01Z

After looking at historical data

November -
December -

it seems that mine minimalIngestionProfile setting was ignored since ~12 of November till mid December, and now is keep being ignored in one cluster. It seems like configmap setting keeps getting ignored similar to #4689

So after all flagging the quota here was a red herring, as the reason for increased ingestion was not respecting minimalIngestionProfile setting from configmap.

At least this is my current working theory.

grzesuav added the bug label Dec 10, 2024

aritraghosh added the addon/ama-metrics label Dec 11, 2024

microsoft-github-policy-service bot assigned vishiy Dec 11, 2024

grzesuav mentioned this issue Dec 30, 2024

[BUG] Customization of ConfigMap for Control Plane Metrics Not Reflected in Azure Managed Prometheus #4689

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] AKS Managed prometheus - Big discrepancy between portal metrics and prometheus metrics #4696

[BUG] AKS Managed prometheus - Big discrepancy between portal metrics and prometheus metrics #4696

grzesuav commented Dec 10, 2024

grzesuav commented Dec 10, 2024

vishiy commented Dec 11, 2024

grzesuav commented Dec 23, 2024 •

edited

Loading

grzesuav commented Dec 23, 2024

grzesuav commented Dec 27, 2024

grzesuav commented Dec 30, 2024

[BUG] AKS Managed prometheus - Big discrepancy between portal metrics and prometheus metrics #4696

[BUG] AKS Managed prometheus - Big discrepancy between portal metrics and prometheus metrics #4696

Comments

grzesuav commented Dec 10, 2024

grzesuav commented Dec 10, 2024

vishiy commented Dec 11, 2024

grzesuav commented Dec 23, 2024 • edited Loading

grzesuav commented Dec 23, 2024

grzesuav commented Dec 27, 2024

grzesuav commented Dec 30, 2024

grzesuav commented Dec 23, 2024 •

edited

Loading