From 451ef9b9459da9acbab07e0d87486b0cf959aec1 Mon Sep 17 00:00:00 2001 From: "W. Trevor King" Date: Fri, 25 Oct 2024 10:42:44 -0700 Subject: [PATCH] dist/openshift/cincinnati-deployment: Shift Deployment replicas to MAX_REPLICAS We'd dropped 'replicas' in 82897813ed (replace HPA with keda ScaledObject, 2024-10-09, #953), following AppSRE advice [1]. Rolling that Template change out caused the Deployment to drop briefly to replicas:1 before Keda raised it back up to MIN_REPLICAS (as predicted [1]). But in our haste to recover from the incident, we raised both MIN_REPLICAS (good) and restored the replicas line in 0bbb1b8721 (bring back the replica field and set it to min-replicas, 2024-10-24, #967). That means we will need some future Template change to revert 0bbb1b8721 and re-drop 'replicas'. In the meantime, every Template application will cause the Deployment to blip to the Template-declared value briefly, before Keda resets it to the value it prefers. Before this commit, the blip value is MIN_REPLICAS, which can lead to rollouts like: $ oc -n cincinnati-production get -w -o wide deployment cincinnati NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR ... cincinnati 0/6 6 0 86s cincinnati-graph-builder,cincinnati-policy-engine quay.io/app-sre/cincinnati:latest,quay.io/app-sre/cincinnati:latest app=cincinnati cincinnati 0/2 6 0 2m17s cincinnati-graph-builder,cincinnati-policy-engine quay.io/app-sre/cincinnati:latest,quay.io/app-sre/cincinnati:latest app=cincinnati ... when Keda wants 6 replicas and we push: $ oc process -p MIN_REPLICAS=2 -p MAX_REPLICAS=12 -f dist/openshift/cincinnati-deployment.yaml | oc -n cincinnati-production apply -f - deployment.apps/cincinnati configured prometheusrule.monitoring.coreos.com/cincinnati-recording-rule unchanged service/cincinnati-graph-builder unchanged ... The Pod terminations on the blip to MIN_REPLICAS will drop our capacity to serve clients, and at the moment it can take some time to recover that capacity in replacement Pods. Changes like 31ceb1de56 (add retry logic to fetching blob from container registry, 2024-10-24, #969) should speed new-Pod availability and reduce that risk. This commit moves the blip over to MAX_REPLICAS to avoid Pod-termination risk entirely. Instead, we'll surge unnecessary Pods, and potentially autoscale unnecessary Machines to host those Pods. But then Keda will return us to its preferred value, and we'll delete the still-coming-up Pods and scale down any extra Machines. Spending a bit of money on extra cloud Machines for each Template application seems like a lower risk than the Pod-termination risk, to get us through safely until we are prepared to remove 'replicas' again and eat its one-time replicas:1, Pod-termination blip. [1]: https://gitlab.cee.redhat.com/service/app-interface/-/blob/649aa9b681acf076a39eb4eecf0f88ff1cacbdcd/docs/app-sre/runbook/custom-metrics-autoscaler.md#L252 (internal link, sorry external folks) --- dist/openshift/cincinnati-deployment.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dist/openshift/cincinnati-deployment.yaml b/dist/openshift/cincinnati-deployment.yaml index b9d86c709..3417a498a 100644 --- a/dist/openshift/cincinnati-deployment.yaml +++ b/dist/openshift/cincinnati-deployment.yaml @@ -11,7 +11,7 @@ objects: app: cincinnati name: cincinnati spec: - replicas: ${{MIN_REPLICAS}} + replicas: ${{MAX_REPLICAS}} selector: matchLabels: app: cincinnati