dist/openshift/cincinnati-deployment: Shift Deployment replicas to MA… · wking/cincinnati@2a25490

Commit

dist/openshift/cincinnati-deployment: Shift Deployment replicas to MA…

…X_REPLICAS

We'd dropped 'replicas' in 8289781 (replace HPA with keda
ScaledObject, 2024-10-09, openshift#953), following AppSRE advice [1].  Rolling
that Template change out caused the Deployment to drop briefly to
replicas:1 before Keda raised it back up to MIN_REPLICAS (as predicted
[1]).  But in our haste to recover from the incdent, we raised both
MIN_REPLICAS (good) and restored the replicas line in 0bbb1b8
(bring back the replica field and set it to min-replicas, 2024-10-24, openshift#967).

That means we will need some future Template change to revert
0bbb1b8 and re-drop 'replicas'.  In the meantime, every Template
application will cause the Deployment to blip to the Template-declared
value briefly, before Keda resets it to the value it prefers.  Before
this commit, the blip value is MIN_REPLICAS, which can lead to
rollouts like:

  $ oc -n cincinnati-production get -w -o wide deployment cincinnati
  NAME         READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS                                          IMAGES                                                                SELECTOR
  ...
  cincinnati   0/6     6            0           86s   cincinnati-graph-builder,cincinnati-policy-engine   quay.io/app-sre/cincinnati:latest,quay.io/app-sre/cincinnati:latest   app=cincinnati
  cincinnati   0/2     6            0           2m17s   cincinnati-graph-builder,cincinnati-policy-engine   quay.io/app-sre/cincinnati:latest,quay.io/app-sre/cincinnati:latest   app=cincinnati
  ...

when Keda wants 6 replicas and we push:

  $ oc process -p MIN_REPLICAS=2 -p MAX_REPLICAS=12 -f dist/openshift/cincinnati-deployment.yaml | oc -n cincinnati-production apply -f -
  deployment.apps/cincinnati configured
  prometheusrule.monitoring.coreos.com/cincinnati-recording-rule unchanged
  service/cincinnati-graph-builder unchanged
  ...

The Pod terminations on the blip to MIN_REPLICAS will drop our
capacity to serve clients, and at the moment it can take some time to
recover that capacity in replacement Pods.  Changes like 31ceb1d
(add retry logic to fetching blob from container registry, 2024-10-24, openshift#969)
should speed new-Pod availability and reduce that risk.

This commit moves the blip over to MAX_REPLICAS to avoid
Pod-termination risk entirely.  Instead, we'll surge unnecessary Pods,
and potentially autoscale unnecessary Machines to host those Pods.
But then Keda will return us to its preferred value, and we'll delete
the still-coming-up Pods and scale down any extra Machines.  Spending
a bit of money on extra cloud Machines for each Template application
seems like a lower risk than the Pod-termination risk, to get us
through safely until we are prepared to remove 'replicas' again and
eat its one-time replicas:1, Pod-termination blip.

[1]: https://gitlab.cee.redhat.com/service/app-interface/-/blob/649aa9b681acf076a39eb4eecf0f88ff1cacbdcd/docs/app-sre/runbook/custom-metrics-autoscaler.md#L252 (internal link, sorry external folks)

Loading branch information

wking committed Oct 25, 2024

1 parent d191b8c commit 2a25490

dist/openshift/cincinnati-deployment.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -11,7 +11,7 @@ objects: @@
             app: cincinnati
           name: cincinnati
         spec:
-          replicas: ${{MIN_REPLICAS}}
+          replicas: ${{MAX_REPLICAS}}
           selector:
             matchLabels:
               app: cincinnati
@@ Expand Down @@

0 comments on commit `2a25490`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `2a25490`

Commit

There are no files selected for viewing

0 comments on commit 2a25490

0 comments on commit `2a25490`