OTA-1385: dist/openshift/cincinnati-deployment: Shift Deployment replicas to MAX_REPLICAS #975

wking · 2024-10-25T18:05:29Z

We'd dropped replicas in 8289781 (#953), following AppSRE advice (internal docs, sorry external folks). Rolling that Template change out caused the Deployment to drop briefly to replicas:1 before Keda raised it back up to MIN_REPLICAS (as predicted, same internal link). But in our haste to recover from the incident, we raised both MIN_REPLICAS (good) and restored the replicas line in 0bbb1b8 (#967).

That means we will need some future Template change to revert 0bbb1b8 and re-drop replicas. In the meantime, every Template application will cause the Deployment to blip to the Template-declared value briefly, before Keda resets it to the value it prefers. Before this commit, the blip value is MIN_REPLICAS, which can lead to rollouts like:

$ oc -n cincinnati-production get -w -o wide deployment cincinnati
NAME         READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS                                          IMAGES                                                                SELECTOR
...
cincinnati   0/6     6            0           86s   cincinnati-graph-builder,cincinnati-policy-engine   quay.io/app-sre/cincinnati:latest,quay.io/app-sre/cincinnati:latest   app=cincinnati
cincinnati   0/2     6            0           2m17s   cincinnati-graph-builder,cincinnati-policy-engine   quay.io/app-sre/cincinnati:latest,quay.io/app-sre/cincinnati:latest   app=cincinnati
...

when Keda wants 6 replicas and we push:

$ oc process -p MIN_REPLICAS=2 -p MAX_REPLICAS=12 -f dist/openshift/cincinnati-deployment.yaml | oc -n cincinnati-production apply -f -
deployment.apps/cincinnati configured
prometheusrule.monitoring.coreos.com/cincinnati-recording-rule unchanged
service/cincinnati-graph-builder unchanged
...

The Pod terminations on the blip to MIN_REPLICAS will drop our capacity to serve clients, and at the moment it can take some time to recover that capacity in replacement Pods. Changes like 31ceb1d (#969) should speed new-Pod availability and reduce that risk.

This commit moves the blip over to MAX_REPLICAS to avoid Pod-termination risk entirely. Instead, we'll surge unnecessary Pods, and potentially autoscale unnecessary Machines to host those Pods. But then Keda will return us to its preferred value, and we'll delete the still-coming-up Pods and scale down any extra Machines. Spending a bit of money on extra cloud Machines for each Template application seems like a lower risk than the Pod-termination risk, to get us through safely until we are prepared to remove replicas again and eat its one-time replicas:1, Pod-termination blip.

…X_REPLICAS We'd dropped 'replicas' in 8289781 (replace HPA with keda ScaledObject, 2024-10-09, openshift#953), following AppSRE advice [1]. Rolling that Template change out caused the Deployment to drop briefly to replicas:1 before Keda raised it back up to MIN_REPLICAS (as predicted [1]). But in our haste to recover from the incident, we raised both MIN_REPLICAS (good) and restored the replicas line in 0bbb1b8 (bring back the replica field and set it to min-replicas, 2024-10-24, openshift#967). That means we will need some future Template change to revert 0bbb1b8 and re-drop 'replicas'. In the meantime, every Template application will cause the Deployment to blip to the Template-declared value briefly, before Keda resets it to the value it prefers. Before this commit, the blip value is MIN_REPLICAS, which can lead to rollouts like: $ oc -n cincinnati-production get -w -o wide deployment cincinnati NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR ... cincinnati 0/6 6 0 86s cincinnati-graph-builder,cincinnati-policy-engine quay.io/app-sre/cincinnati:latest,quay.io/app-sre/cincinnati:latest app=cincinnati cincinnati 0/2 6 0 2m17s cincinnati-graph-builder,cincinnati-policy-engine quay.io/app-sre/cincinnati:latest,quay.io/app-sre/cincinnati:latest app=cincinnati ... when Keda wants 6 replicas and we push: $ oc process -p MIN_REPLICAS=2 -p MAX_REPLICAS=12 -f dist/openshift/cincinnati-deployment.yaml | oc -n cincinnati-production apply -f - deployment.apps/cincinnati configured prometheusrule.monitoring.coreos.com/cincinnati-recording-rule unchanged service/cincinnati-graph-builder unchanged ... The Pod terminations on the blip to MIN_REPLICAS will drop our capacity to serve clients, and at the moment it can take some time to recover that capacity in replacement Pods. Changes like 31ceb1d (add retry logic to fetching blob from container registry, 2024-10-24, openshift#969) should speed new-Pod availability and reduce that risk. This commit moves the blip over to MAX_REPLICAS to avoid Pod-termination risk entirely. Instead, we'll surge unnecessary Pods, and potentially autoscale unnecessary Machines to host those Pods. But then Keda will return us to its preferred value, and we'll delete the still-coming-up Pods and scale down any extra Machines. Spending a bit of money on extra cloud Machines for each Template application seems like a lower risk than the Pod-termination risk, to get us through safely until we are prepared to remove 'replicas' again and eat its one-time replicas:1, Pod-termination blip. [1]: https://gitlab.cee.redhat.com/service/app-interface/-/blob/649aa9b681acf076a39eb4eecf0f88ff1cacbdcd/docs/app-sre/runbook/custom-metrics-autoscaler.md#L252 (internal link, sorry external folks)

openshift-ci-robot · 2024-10-25T18:05:33Z

@wking: This pull request references OTA-1385 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

We'd dropped replicas in 8289781 (#953), following AppSRE advice (internal docs, sorry external folks). Rolling that Template change out caused the Deployment to drop briefly to replicas:1 before Keda raised it back up to MIN_REPLICAS (as predicted, same internal link). But in our haste to recover from the incident, we raised both MIN_REPLICAS (good) and restored the replicas line in 0bbb1b8 (#967).

That means we will need some future Template change to revert 0bbb1b8 and re-drop replicas. In the meantime, every Template application will cause the Deployment to blip to the Template-declared value briefly, before Keda resets it to the value it prefers. Before this commit, the blip value is MIN_REPLICAS, which can lead to rollouts like:
$ oc -n cincinnati-production get -w -o wide deployment cincinnati
NAME         READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS                                          IMAGES                                                                SELECTOR
...
cincinnati   0/6     6            0           86s   cincinnati-graph-builder,cincinnati-policy-engine   quay.io/app-sre/cincinnati:latest,quay.io/app-sre/cincinnati:latest   app=cincinnati
cincinnati   0/2     6            0           2m17s   cincinnati-graph-builder,cincinnati-policy-engine   quay.io/app-sre/cincinnati:latest,quay.io/app-sre/cincinnati:latest   app=cincinnati
...
when Keda wants 6 replicas and we push:
$ oc process -p MIN_REPLICAS=2 -p MAX_REPLICAS=12 -f dist/openshift/cincinnati-deployment.yaml | oc -n cincinnati-production apply -f -
deployment.apps/cincinnati configured
prometheusrule.monitoring.coreos.com/cincinnati-recording-rule unchanged
service/cincinnati-graph-builder unchanged
...
The Pod terminations on the blip to MIN_REPLICAS will drop our capacity to serve clients, and at the moment it can take some time to recover that capacity in replacement Pods. Changes like 31ceb1d (#969) should speed new-Pod availability and reduce that risk.

This commit moves the blip over to MAX_REPLICAS to avoid Pod-termination risk entirely. Instead, we'll surge unnecessary Pods, and potentially autoscale unnecessary Machines to host those Pods. But then Keda will return us to its preferred value, and we'll delete the still-coming-up Pods and scale down any extra Machines. Spending a bit of money on extra cloud Machines for each Template application seems like a lower risk than the Pod-termination risk, to get us through safely until we are prepared to remove replicas again and eat its one-time replicas:1, Pod-termination blip.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

PratikMahajan

/lgtm

openshift-ci · 2024-11-01T19:14:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: PratikMahajan, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [PratikMahajan,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-11-01T19:35:55Z

/retest-required

Remaining retests: 0 against base HEAD 4398c21 and 2 for PR HEAD 451ef9b in total

openshift-ci-robot · 2024-11-02T18:55:35Z

/retest-required

Remaining retests: 0 against base HEAD 4398c21 and 2 for PR HEAD 451ef9b in total

PratikMahajan · 2024-11-04T04:29:24Z

/override ci/prow/cargo-test
/override ci/prow/customrust-cargo-test

openshift-ci · 2024-11-04T04:29:37Z

@PratikMahajan: Overrode contexts on behalf of PratikMahajan: ci/prow/cargo-test, ci/prow/customrust-cargo-test

In response to this:

/override ci/prow/cargo-test
/override ci/prow/customrust-cargo-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2024-11-04T09:05:24Z

/retest-required

Remaining retests: 0 against base HEAD 4398c21 and 2 for PR HEAD 451ef9b in total

openshift-ci · 2024-11-04T09:41:23Z

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 25, 2024

openshift-ci bot requested review from LalatenduMohanty and petr-muller October 25, 2024 18:06

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 25, 2024

PratikMahajan approved these changes Nov 1, 2024

View reviewed changes

openshift-ci bot assigned PratikMahajan Nov 1, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 1, 2024

openshift-merge-bot bot merged commit 013989a into openshift:master Nov 4, 2024
13 checks passed

wking deleted the replicas-from-MAX_REPLICAS branch November 4, 2024 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTA-1385: dist/openshift/cincinnati-deployment: Shift Deployment replicas to MAX_REPLICAS #975

OTA-1385: dist/openshift/cincinnati-deployment: Shift Deployment replicas to MAX_REPLICAS #975

wking commented Oct 25, 2024

openshift-ci-robot commented Oct 25, 2024 •

edited by openshift-ci bot

Loading

PratikMahajan left a comment

openshift-ci bot commented Nov 1, 2024

openshift-ci-robot commented Nov 1, 2024

openshift-ci-robot commented Nov 2, 2024

PratikMahajan commented Nov 4, 2024

openshift-ci bot commented Nov 4, 2024

openshift-ci-robot commented Nov 4, 2024

openshift-ci bot commented Nov 4, 2024

OTA-1385: dist/openshift/cincinnati-deployment: Shift Deployment replicas to MAX_REPLICAS #975

OTA-1385: dist/openshift/cincinnati-deployment: Shift Deployment replicas to MAX_REPLICAS #975

Conversation

wking commented Oct 25, 2024

openshift-ci-robot commented Oct 25, 2024 • edited by openshift-ci bot Loading

PratikMahajan left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Nov 1, 2024

openshift-ci-robot commented Nov 1, 2024

openshift-ci-robot commented Nov 2, 2024

PratikMahajan commented Nov 4, 2024

openshift-ci bot commented Nov 4, 2024

openshift-ci-robot commented Nov 4, 2024

openshift-ci bot commented Nov 4, 2024

openshift-ci-robot commented Oct 25, 2024 •

edited by openshift-ci bot

Loading