Skip to content

Commit

Permalink
Scale on OOMs for autoscaling components (#5739)
Browse files Browse the repository at this point in the history
* Add memory trigger to ruler-querier and account for OOMs in memory calculation for ruler-querier and ruler-query-frontend

* Add use_oom_trigger flag to all autoscaling components

* remove use_oom_trigger flag

* add memory panel to remote ruler reads dashboard

* build mixins

* update HPA query

* Update changelog

* changelog fixes
  • Loading branch information
chencs authored Sep 27, 2023
1 parent fc68c93 commit b7c052c
Show file tree
Hide file tree
Showing 9 changed files with 469 additions and 16 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,7 @@
* [BUGFIX] Alerts: fixed `MimirIngesterHasNotShippedBlocks` and `MimirIngesterHasNotShippedBlocksSinceStart` alerts. #5396
* [BUGFIX] Alerts: Fix `MimirGossipMembersMismatch` to include `admin-api` and custom compactor pods. `admin-api` is a GEM component. #5641 #5797
* [BUGFIX] Dashboards: fix autoscaling dashboard panels that could show multiple series for a single component. #5810
* [BUGFIX] Dashboards: fix ruler-querier scaling metric panel query and split into CPU and memory scaling metric panels. #5739

### Jsonnet

Expand Down Expand Up @@ -234,7 +235,9 @@
* [ENHANCEMENT] Add _config.commonConfig to allow adding common configuration parameters for all Mimir components. #5703
* [ENHANCEMENT] Update rollout-operator to `v0.7.0`. #5718
* [ENHANCEMENT] Increase the default rollout speed for store-gateway when lazy loading is disabled. #5823
* [ENHANCEMENT] Add autoscaling on memory for ruler-queriers. #5739
* [BUGFIX] Fix compilation when index, chunks or metadata caches are disabled. #5710
* [BUGFIX] Autoscaling: treat OOMing containers as though they are using their full memory request. #5739

### Mimirtool

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26967,7 +26967,7 @@ data:
"renderer": "flot",
"seriesOverrides": [ ],
"spaceLength": 10,
"span": 4,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
Expand Down Expand Up @@ -27057,7 +27057,7 @@ data:
"renderer": "flot",
"seriesOverrides": [ ],
"spaceLength": 10,
"span": 4,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
Expand Down Expand Up @@ -27111,7 +27111,7 @@ data:
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"description": "### Autoscaler failures rate\nThe rate of failures in the KEDA custom metrics API server. Whenever an error occurs, the KEDA custom\nmetrics server is unable to query the scaling metric from Prometheus so the autoscaler does not work properly.\n\n",
"description": "### Scaling metric (memory): Desired replicas\nThis panel shows the scaling metric exposed by KEDA divided by the target/threshold used.\nIt should represent the desired number of replicas, ignoring the min/max constraints applied later.\n\n",
"fill": 1,
"id": 13,
"legend": {
Expand All @@ -27133,7 +27133,83 @@ data:
"renderer": "flot",
"seriesOverrides": [ ],
"spaceLength": 10,
"span": 4,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "keda_metrics_adapter_scaler_metrics_value{metric=~\".*memory.*\"}\n/\non(metric) group_left label_replace(\n kube_horizontalpodautoscaler_spec_target_metric{cluster=~\"$cluster\", namespace=~\"$namespace\", horizontalpodautoscaler=~\"keda-hpa-ruler-querier\"},\n \"metric\", \"$1\", \"metric_name\", \"(.+)\"\n)\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{ scaledObject }}",
"legendLink": null
}
],
"thresholds": [ ],
"timeFrom": null,
"timeShift": null,
"title": "Scaling metric (memory): Desired replicas",
"tooltip": {
"shared": false,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": [ ]
},
"yaxes": [
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": 0,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
]
},
{
"aliasColors": { },
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"description": "### Autoscaler failures rate\nThe rate of failures in the KEDA custom metrics API server. Whenever an error occurs, the KEDA custom\nmetrics server is unable to query the scaling metric from Prometheus so the autoscaler does not work properly.\n\n",
"fill": 1,
"id": 14,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"links": [ ],
"nullPointMode": "null as zero",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [ ],
"spaceLength": 10,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -835,7 +835,7 @@
"renderer": "flot",
"seriesOverrides": [ ],
"spaceLength": 10,
"span": 4,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
Expand Down Expand Up @@ -925,7 +925,7 @@
"renderer": "flot",
"seriesOverrides": [ ],
"spaceLength": 10,
"span": 4,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
Expand Down Expand Up @@ -979,7 +979,7 @@
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"description": "### Autoscaler failures rate\nThe rate of failures in the KEDA custom metrics API server. Whenever an error occurs, the KEDA custom\nmetrics server is unable to query the scaling metric from Prometheus so the autoscaler does not work properly.\n\n",
"description": "### Scaling metric (memory): Desired replicas\nThis panel shows the scaling metric exposed by KEDA divided by the target/threshold used.\nIt should represent the desired number of replicas, ignoring the min/max constraints applied later.\n\n",
"fill": 1,
"id": 13,
"legend": {
Expand All @@ -1001,7 +1001,83 @@
"renderer": "flot",
"seriesOverrides": [ ],
"spaceLength": 10,
"span": 4,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "keda_metrics_adapter_scaler_metrics_value{metric=~\".*memory.*\"}\n/\non(metric) group_left label_replace(\n kube_horizontalpodautoscaler_spec_target_metric{cluster=~\"$cluster\", namespace=~\"$namespace\", horizontalpodautoscaler=~\"keda-hpa-ruler-querier\"},\n \"metric\", \"$1\", \"metric_name\", \"(.+)\"\n)\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{ scaledObject }}",
"legendLink": null
}
],
"thresholds": [ ],
"timeFrom": null,
"timeShift": null,
"title": "Scaling metric (memory): Desired replicas",
"tooltip": {
"shared": false,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": [ ]
},
"yaxes": [
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": 0,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
]
},
{
"aliasColors": { },
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"description": "### Autoscaler failures rate\nThe rate of failures in the KEDA custom metrics API server. Whenever an error occurs, the KEDA custom\nmetrics server is unable to query the scaling metric from Prometheus so the autoscaler does not work properly.\n\n",
"fill": 1,
"id": 14,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"links": [ ],
"nullPointMode": "null as zero",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [ ],
"spaceLength": 10,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -835,7 +835,7 @@
"renderer": "flot",
"seriesOverrides": [ ],
"spaceLength": 10,
"span": 4,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
Expand Down Expand Up @@ -925,7 +925,7 @@
"renderer": "flot",
"seriesOverrides": [ ],
"spaceLength": 10,
"span": 4,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
Expand Down Expand Up @@ -979,7 +979,7 @@
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"description": "### Autoscaler failures rate\nThe rate of failures in the KEDA custom metrics API server. Whenever an error occurs, the KEDA custom\nmetrics server is unable to query the scaling metric from Prometheus so the autoscaler does not work properly.\n\n",
"description": "### Scaling metric (memory): Desired replicas\nThis panel shows the scaling metric exposed by KEDA divided by the target/threshold used.\nIt should represent the desired number of replicas, ignoring the min/max constraints applied later.\n\n",
"fill": 1,
"id": 13,
"legend": {
Expand All @@ -1001,7 +1001,83 @@
"renderer": "flot",
"seriesOverrides": [ ],
"spaceLength": 10,
"span": 4,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "keda_metrics_adapter_scaler_metrics_value{metric=~\".*memory.*\"}\n/\non(metric) group_left label_replace(\n kube_horizontalpodautoscaler_spec_target_metric{cluster=~\"$cluster\", namespace=~\"$namespace\", horizontalpodautoscaler=~\"keda-hpa-ruler-querier\"},\n \"metric\", \"$1\", \"metric_name\", \"(.+)\"\n)\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{ scaledObject }}",
"legendLink": null
}
],
"thresholds": [ ],
"timeFrom": null,
"timeShift": null,
"title": "Scaling metric (memory): Desired replicas",
"tooltip": {
"shared": false,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": [ ]
},
"yaxes": [
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": 0,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
]
},
{
"aliasColors": { },
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"description": "### Autoscaler failures rate\nThe rate of failures in the KEDA custom metrics API server. Whenever an error occurs, the KEDA custom\nmetrics server is unable to query the scaling metric from Prometheus so the autoscaler does not work properly.\n\n",
"fill": 1,
"id": 14,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"links": [ ],
"nullPointMode": "null as zero",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [ ],
"spaceLength": 10,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
Expand Down
28 changes: 28 additions & 0 deletions operations/mimir-mixin/dashboards/remote-ruler-reads.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,34 @@ local filename = 'mimir-remote-ruler-reads.json';
|||
),
)
.addPanel(
local title = 'Scaling metric (memory): Desired replicas';
$.panel(title) +
$.queryPanel(
[
|||
keda_metrics_adapter_scaler_metrics_value{metric=~".*memory.*"}
/
on(metric) group_left label_replace(
kube_horizontalpodautoscaler_spec_target_metric{%(namespace)s, horizontalpodautoscaler=~"%(hpa_name)s"},
"metric", "$1", "metric_name", "(.+)"
)
||| % {
hpa_name: $._config.autoscaling.ruler_querier.hpa_name,
namespace: $.namespaceMatcher(),
},
], [
'{{ scaledObject }}',
]
) +
$.panelDescription(
title,
|||
This panel shows the scaling metric exposed by KEDA divided by the target/threshold used.
It should represent the desired number of replicas, ignoring the min/max constraints applied later.
|||
),
)
.addPanel(
local title = 'Autoscaler failures rate';
$.panel(title) +
Expand Down
Loading

0 comments on commit b7c052c

Please sign in to comment.