OTA-1378,OTA-1379: add retry logic for pulling images and less logs for sigs #969

PratikMahajan · 2024-10-23T23:40:18Z

we're adding a retry logic that tries to pull layers if it fails due to any reason. We retry for 3 times before ultimately failing.

Also moved the warn log for sig pulling to debug and added a counter to tell how many sig images we've ignored

cincinnati/src/plugins/internal/graph_builder/release_scrape_dockerv2/registry/mod.rs

openshift-ci-robot · 2024-10-24T15:24:37Z

@PratikMahajan: This pull request references OTA-1379 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

we're adding a retry logic that tries to pull layers if it fails due to any reason. We retry for 3 times before ultimately failing.

Also moved the warn log for sig pulling to debug and added a counter to tell how many sig images we've ignored

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

wking · 2024-10-24T15:45:09Z

cincinnati/src/plugins/internal/graph_builder/release_scrape_dockerv2/registry/mod.rs

-                            &repo,
-                            &tag,
-                            e
+                        if tag.contains(".sig") {


Tag-based discovery is one option for finding Sigstore signatures. In the future, we might move to listing referrers (containers/image#2030). But if we do, the failure modes for this line:

Misclassifying a non-sig as a Sigstore signature because it happens to use this tag structure, or

Misclassifying a Sigstore signature as a non-sig, non-release ignored image,

both seem low, so 🤷, I'm ok with this heuristic.

for misclassifying non-sig to sig will always be a risk if we do string comparison. but imo should be rare.
if signature gets classifies as non-sig, the logs should bring that into notice. Not a lot worried about the mismatch.
We can also change this logic when we get listing referrers in dkregistry and pull it downstream to cincinnati.

move the encountered signatures log from warn to debug and count the number of signature as well as invalid releases and log the count instead.

adds retry logic so we're more resilient to failures on container registry part. we try fetching the manifest and manifest ref for 3 times before ultimately failing.

retries fetching the blob instead of erroring out and erasing the progress till the point

wking · 2024-10-24T15:55:57Z

cincinnati/src/plugins/internal/graph_builder/release_scrape_dockerv2/registry/mod.rs

+                Err(e) => {
+                    // signatures are not identified by dkregistry and not useful for cincinnati graph, dont retry and return error
+                    if tag.contains(".sig") {
+                        return Err(e);


and then this error bubbles up and is converted to a debug message via the .sig branch of fetch_releases's get_manifest_layers handling in 022a8d6.

wking

/lgtm

openshift-ci · 2024-10-24T16:00:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: PratikMahajan, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [PratikMahajan,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-10-24T16:23:05Z

/retest-required

Remaining retests: 0 against base HEAD 4de7870 and 2 for PR HEAD 31ceb1d in total

openshift-ci-robot · 2024-10-24T16:46:24Z

@PratikMahajan: This pull request references OTA-1378 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

This pull request references OTA-1379 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

we're adding a retry logic that tries to pull layers if it fails due to any reason. We retry for 3 times before ultimately failing.

Also moved the warn log for sig pulling to debug and added a counter to tell how many sig images we've ignored

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

PratikMahajan · 2024-10-24T17:42:15Z

/override ci/prow/customrust-cargo-test
/override ci/prow/cargo-test

override known test failures

openshift-ci · 2024-10-24T17:42:35Z

@PratikMahajan: Overrode contexts on behalf of PratikMahajan: ci/prow/cargo-test, ci/prow/customrust-cargo-test

In response to this:

/override ci/prow/customrust-cargo-test
/override ci/prow/cargo-test

override known test failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2024-10-24T17:42:51Z

@PratikMahajan: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

This PR is only used to test openshift#969 Please ignore it, I will close it later

…X_REPLICAS We'd dropped 'replicas' in 8289781 (replace HPA with keda ScaledObject, 2024-10-09, openshift#953), following AppSRE advice [1]. Rolling that Template change out caused the Deployment to drop briefly to replicas:1 before Keda raised it back up to MIN_REPLICAS (as predicted [1]). But in our haste to recover from the incdent, we raised both MIN_REPLICAS (good) and restored the replicas line in 0bbb1b8 (bring back the replica field and set it to min-replicas, 2024-10-24, openshift#967). That means we will need some future Template change to revert 0bbb1b8 and re-drop 'replicas'. In the meantime, every Template application will cause the Deployment to blip to the Template-declared value briefly, before Keda resets it to the value it prefers. Before this commit, the blip value is MIN_REPLICAS, which can lead to rollouts like: $ oc -n cincinnati-production get -w -o wide deployment cincinnati NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR ... cincinnati 0/6 6 0 86s cincinnati-graph-builder,cincinnati-policy-engine quay.io/app-sre/cincinnati:latest,quay.io/app-sre/cincinnati:latest app=cincinnati cincinnati 0/2 6 0 2m17s cincinnati-graph-builder,cincinnati-policy-engine quay.io/app-sre/cincinnati:latest,quay.io/app-sre/cincinnati:latest app=cincinnati ... when Keda wants 6 replicas and we push: $ oc process -p MIN_REPLICAS=2 -p MAX_REPLICAS=12 -f dist/openshift/cincinnati-deployment.yaml | oc -n cincinnati-production apply -f - deployment.apps/cincinnati configured prometheusrule.monitoring.coreos.com/cincinnati-recording-rule unchanged service/cincinnati-graph-builder unchanged ... The Pod terminations on the blip to MIN_REPLICAS will drop our capacity to serve clients, and at the moment it can take some time to recover that capacity in replacement Pods. Changes like 31ceb1d (add retry logic to fetching blob from container registry, 2024-10-24, openshift#969) should speed new-Pod availability and reduce that risk. This commit moves the blip over to MAX_REPLICAS to avoid Pod-termination risk entirely. Instead, we'll surge unnecessary Pods, and potentially autoscale unnecessary Machines to host those Pods. But then Keda will return us to its preferred value, and we'll delete the still-coming-up Pods and scale down any extra Machines. Spending a bit of money on extra cloud Machines for each Template application seems like a lower risk than the Pod-termination risk, to get us through safely until we are prepared to remove 'replicas' again and eat its one-time replicas:1, Pod-termination blip. [1]: https://gitlab.cee.redhat.com/service/app-interface/-/blob/649aa9b681acf076a39eb4eecf0f88ff1cacbdcd/docs/app-sre/runbook/custom-metrics-autoscaler.md#L252 (internal link, sorry external folks)

…X_REPLICAS We'd dropped 'replicas' in 8289781 (replace HPA with keda ScaledObject, 2024-10-09, openshift#953), following AppSRE advice [1]. Rolling that Template change out caused the Deployment to drop briefly to replicas:1 before Keda raised it back up to MIN_REPLICAS (as predicted [1]). But in our haste to recover from the incident, we raised both MIN_REPLICAS (good) and restored the replicas line in 0bbb1b8 (bring back the replica field and set it to min-replicas, 2024-10-24, openshift#967). That means we will need some future Template change to revert 0bbb1b8 and re-drop 'replicas'. In the meantime, every Template application will cause the Deployment to blip to the Template-declared value briefly, before Keda resets it to the value it prefers. Before this commit, the blip value is MIN_REPLICAS, which can lead to rollouts like: $ oc -n cincinnati-production get -w -o wide deployment cincinnati NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR ... cincinnati 0/6 6 0 86s cincinnati-graph-builder,cincinnati-policy-engine quay.io/app-sre/cincinnati:latest,quay.io/app-sre/cincinnati:latest app=cincinnati cincinnati 0/2 6 0 2m17s cincinnati-graph-builder,cincinnati-policy-engine quay.io/app-sre/cincinnati:latest,quay.io/app-sre/cincinnati:latest app=cincinnati ... when Keda wants 6 replicas and we push: $ oc process -p MIN_REPLICAS=2 -p MAX_REPLICAS=12 -f dist/openshift/cincinnati-deployment.yaml | oc -n cincinnati-production apply -f - deployment.apps/cincinnati configured prometheusrule.monitoring.coreos.com/cincinnati-recording-rule unchanged service/cincinnati-graph-builder unchanged ... The Pod terminations on the blip to MIN_REPLICAS will drop our capacity to serve clients, and at the moment it can take some time to recover that capacity in replacement Pods. Changes like 31ceb1d (add retry logic to fetching blob from container registry, 2024-10-24, openshift#969) should speed new-Pod availability and reduce that risk. This commit moves the blip over to MAX_REPLICAS to avoid Pod-termination risk entirely. Instead, we'll surge unnecessary Pods, and potentially autoscale unnecessary Machines to host those Pods. But then Keda will return us to its preferred value, and we'll delete the still-coming-up Pods and scale down any extra Machines. Spending a bit of money on extra cloud Machines for each Template application seems like a lower risk than the Pod-termination risk, to get us through safely until we are prepared to remove 'replicas' again and eat its one-time replicas:1, Pod-termination blip. [1]: https://gitlab.cee.redhat.com/service/app-interface/-/blob/649aa9b681acf076a39eb4eecf0f88ff1cacbdcd/docs/app-sre/runbook/custom-metrics-autoscaler.md#L252 (internal link, sorry external folks)

openshift-ci bot requested review from petr-muller and wking October 23, 2024 23:40

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 23, 2024

wking reviewed Oct 24, 2024

View reviewed changes

cincinnati/src/plugins/internal/graph_builder/release_scrape_dockerv2/registry/mod.rs Show resolved Hide resolved

PratikMahajan force-pushed the fix-sig-log-retry branch 5 times, most recently from 4d9eb3f to 0368585 Compare October 24, 2024 15:14

PratikMahajan changed the title ~~add retry logic for pulling images and less logs for sigs~~ OTA-1379: add retry logic for pulling images and less logs for sigs Oct 24, 2024

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 24, 2024

wking reviewed Oct 24, 2024

View reviewed changes

PratikMahajan added 3 commits October 24, 2024 15:47

reduce logging sig releases and log count instead

022a8d6

move the encountered signatures log from warn to debug and count the number of signature as well as invalid releases and log the count instead.

add retry logic for fetching manifest and manifestref

bdceaed

adds retry logic so we're more resilient to failures on container registry part. we try fetching the manifest and manifest ref for 3 times before ultimately failing.

add retry logic to fetching blob from container registry

31ceb1d

retries fetching the blob instead of erroring out and erasing the progress till the point

PratikMahajan force-pushed the fix-sig-log-retry branch from 0368585 to 31ceb1d Compare October 24, 2024 15:47

wking reviewed Oct 24, 2024

View reviewed changes

wking approved these changes Oct 24, 2024

View reviewed changes

openshift-ci bot assigned wking Oct 24, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 24, 2024

wking changed the title ~~OTA-1379: add retry logic for pulling images and less logs for sigs~~ OTA-1378,OTA-1379: add retry logic for pulling images and less logs for sigs Oct 24, 2024

openshift-merge-bot bot merged commit 894e8e8 into openshift:master Oct 24, 2024
13 checks passed

JianLi-RH added a commit to JianLi-RH/cincinnati that referenced this pull request Oct 25, 2024

[WIP] Create OTA-1378-1379

ca2d6b2

This PR is only used to test openshift#969 Please ignore it, I will close it later

This was referenced Oct 25, 2024

[WIP] Create OTA-1378-1379 JianLi-RH/cincinnati#1

Closed

[WIP] Test OTA-1378-1379 #974

Closed

wking mentioned this pull request Oct 25, 2024

OTA-1385: dist/openshift/cincinnati-deployment: Shift Deployment replicas to MAX_REPLICAS #975

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTA-1378,OTA-1379: add retry logic for pulling images and less logs for sigs #969

OTA-1378,OTA-1379: add retry logic for pulling images and less logs for sigs #969

PratikMahajan commented Oct 23, 2024

openshift-ci-robot commented Oct 24, 2024 •

edited by openshift-ci bot

Loading

wking Oct 24, 2024

PratikMahajan Oct 24, 2024

wking Oct 24, 2024

wking left a comment

openshift-ci bot commented Oct 24, 2024

openshift-ci-robot commented Oct 24, 2024

openshift-ci-robot commented Oct 24, 2024 •

edited by openshift-ci bot

Loading

PratikMahajan commented Oct 24, 2024

openshift-ci bot commented Oct 24, 2024

openshift-ci bot commented Oct 24, 2024

OTA-1378,OTA-1379: add retry logic for pulling images and less logs for sigs #969

OTA-1378,OTA-1379: add retry logic for pulling images and less logs for sigs #969

Conversation

PratikMahajan commented Oct 23, 2024

openshift-ci-robot commented Oct 24, 2024 • edited by openshift-ci bot Loading

wking Oct 24, 2024

Choose a reason for hiding this comment

PratikMahajan Oct 24, 2024

Choose a reason for hiding this comment

wking Oct 24, 2024

Choose a reason for hiding this comment

wking left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Oct 24, 2024

openshift-ci-robot commented Oct 24, 2024

openshift-ci-robot commented Oct 24, 2024 • edited by openshift-ci bot Loading

PratikMahajan commented Oct 24, 2024

openshift-ci bot commented Oct 24, 2024

openshift-ci bot commented Oct 24, 2024

openshift-ci-robot commented Oct 24, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Oct 24, 2024 •

edited by openshift-ci bot

Loading