Instability of artifact-caching-proxy #4442

darinpope · 2024-12-06T21:04:13Z

Service(s)

Artifact-caching-proxy

Summary

Bruno had to run the weekly BOM release process five times today (2024-12-06) because of errors like the following:

Could not transfer artifact com.google.crypto.tink:tink:jar:1.10.0 from/to azure-aks-internal (http://artifact-caching-proxy.artifact-caching-proxy.svc.cluster.local:8080/): Premature end of Content-Length delimited message body (expected: 2,322,048; received: 1,572,251)

Here's the issue where he tracked the build numbers so you can see the specific failures:

jenkinsci/bom#4066

I also had similar issues doing a BOM weekly-test against a core RC that I'm working on:

[DO NOT MERGE] weekly test with rc jenkinsci/bom#4072

Since I started working on BOM the past couple of months, this problem seems to be getting worse/more unstable as the weeks progress.

Reproduction steps

Unfortunately, it is not reproducible on demand.

The text was updated successfully, but these errors were encountered:

dduportal · 2024-12-13T13:44:07Z

Starting analysing logs on ACP side

dduportal · 2024-12-13T13:58:24Z

For each of the failing requests found in the past 15 days (including each one you folks logged) ACP did report an error due to the upstream, in the following categories:

upstream prematurely closed connection while reading upstream
peer closed connection in SSL handshake (104: Connection reset by peer) while SSL handshaking to upstream
upstream timed out (110: Operation timed out) while SSL handshaking to upstream
Error HTTP/500 responded by Artifactory

We also had 1 occurence repo.jenkins-ci.org could not be resolved (2: Server failure) which indicates a local DNS resolution error.

dduportal · 2024-12-13T14:27:55Z

=> The errors are definitively not due to an ACP problem. By design, it "reports" the error.
Eventually, some timeouts could be caused by the TCP tuning on the ACP instance: gotta check.

=> We could check if we can "retry" the upstream in case of error, I need to recall which cases could be caught

dduportal · 2024-12-13T14:41:10Z

@MarkEWaite did open a PR , based on a discussion we had during the previous infra meeting: jenkinsci/bom#4095

The goal is to "pre-heat" the cache to decrease the probability of facing these issues

darinpope added the triage Incoming issues that need review label Dec 6, 2024

jenkins-infra-helpdesk-app bot added the artifact-caching-proxy label Dec 6, 2024

darinpope changed the title ~~High number of~~ Instability of artifact-caching-proxy Dec 6, 2024

dduportal added this to the infra-team-sync-2024-12-10 milestone Dec 7, 2024

dduportal removed the triage Incoming issues that need review label Dec 9, 2024

dduportal self-assigned this Dec 9, 2024

dduportal modified the milestones: infra-team-sync-2024-12-10, infra-team-sync-2024-12-17 Dec 10, 2024

dduportal modified the milestones: infra-team-sync-2024-12-17, infra-team-sync-2025-01-07 Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instability of artifact-caching-proxy #4442

Instability of artifact-caching-proxy #4442

darinpope commented Dec 6, 2024 •

edited

Loading

dduportal commented Dec 13, 2024

dduportal commented Dec 13, 2024

dduportal commented Dec 13, 2024

dduportal commented Dec 13, 2024

Instability of artifact-caching-proxy #4442

Instability of artifact-caching-proxy #4442

Comments

darinpope commented Dec 6, 2024 • edited Loading

Service(s)

Summary

Reproduction steps

dduportal commented Dec 13, 2024

dduportal commented Dec 13, 2024

dduportal commented Dec 13, 2024

dduportal commented Dec 13, 2024

darinpope commented Dec 6, 2024 •

edited

Loading