Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Action fails regularly due to ETIMEDOUT and ECONNRESET #40

Closed
ZacSweers opened this issue May 15, 2021 · 39 comments
Closed

Action fails regularly due to ETIMEDOUT and ECONNRESET #40

ZacSweers opened this issue May 15, 2021 · 39 comments
Milestone

Comments

@ZacSweers
Copy link

Example runs:

https://github.com/square/anvil/pull/266/checks?check_run_id=2589215352

https://github.com/square/anvil/pull/266/checks?check_run_id=2589215611

I've seen this flakey behavior happen somewhat often in the past few weeks, not sure what else is going on so filing this as an FYI.

@JLLeitschuh
Copy link
Contributor

This may be the same as #33

Hopefully, #39 having been merged will resolve this. @eskatos can you perform a release to see if that helps resolve this issue for our users?

@ZacSweers
Copy link
Author

Is there anything else needed for a release that I could maybe help with? This makes most of our workflows unusable

@JLLeitschuh
Copy link
Contributor

@ZacSweers I believe that you can try out this action from a commit hash. You may want to give that a shot as a stopgap?

@ZacSweers
Copy link
Author

Using ef08c68 appears to resolve things for us. I'd recommend a new 1.x release tag to de-flake things for folks, we were definitely considering dropping this otherwise and I'm not sure how willing folks are to point at a direct sha

@JLLeitschuh
Copy link
Contributor

Should be published now as v1

@ZacSweers
Copy link
Author

Thanks!

@ZacSweers
Copy link
Author

We're still seeing this unfortunately, albeit less often and just as this

Run gradle/wrapper-validation-action@v1
Error: read ECONNRESET

@ZacSweers ZacSweers reopened this Jun 24, 2021
@ZacSweers
Copy link
Author

@ZacSweers
Copy link
Author

This happens pretty consistently across the projects I work on, unfortunately I think we're going to have to remove this action as a result as it's a reliability issue

@JLLeitschuh
Copy link
Contributor

Unfortunately, we don't have enough information at this time to understand what's causing this issue.

Are you using self-hosted runners, or runners hosted by GH?

@ZacSweers
Copy link
Author

I see this often on GH hosted runners, often in the square/anvil repo

@JLLeitschuh
Copy link
Contributor

@eskatos is there any way to add additional log output on failure so that we can work on understanding the root cause?

@nkvaratskhelia
Copy link

nkvaratskhelia commented Sep 21, 2021

This is the error that I'm getting:
Error: connect ETIMEDOUT 104.18.164.99:443
image

@jameswald
Copy link

Github hosted actions here. When the action fails with this error it fails across all active runs around the same time. About 30 minutes ago 3 runs failed simultaneously. Retried each about 20 minutes ago and they all passed.

@JLLeitschuh
Copy link
Contributor

Github hosted actions here. When the action fails with this error it fails across all active runs around the same time. About 30 minutes ago 3 runs failed simultaneously. Retried each about 20 minutes ago and they all passed.

That seems like something that absolutely indicates a cloudflare issue.

@JLLeitschuh
Copy link
Contributor

Okay, all this finally sent me down the right path, I think I may have finally figured out what's going on here. It looks like our Cloudflare WAF is being triggered every once and awhile randomly and is causing a bunch of users connections to fail when it does. I need to talk to @eskatos about how we want to mitigate this issue. Thanks everyone for helping us figure out what was going wrong here.

Screen Shot 2021-09-22 at 10 40 12 AM

@JLLeitschuh
Copy link
Contributor

The fix has been implemented.

Please let us know if any of you continue to experience these problems. I hope this will fix the issue, but we have some additional things we can fiddle with if this continues to be a problem.


FOR INTERNAL TRACKING (not public): https://github.com/gradle/gradle-private/issues/3435

@jivesh
Copy link

jivesh commented Oct 26, 2021

Facing a similar issue. A two-line change to a class causes failures with these actions in the following runs:

  1. https://github.com/AY2122S1-CS2103-T14-2/tp/actions/runs/1386259288/attempts/2 Here, it shows ETIMEDOUT
  2. https://github.com/AY2122S1-CS2103-T14-2/tp/actions/runs/1386259288/attempts/1 Here, it shows
    Client network socket disconnected before secure TLS connection was established

@LunNova
Copy link

LunNova commented Oct 29, 2021

Seeing the same issue here.

https://github.com/MinimallyCorrect/Mixin/runs/4041503110?check_suite_focus=true

Can the team publish a single file with all the hashes instead of having it fetch hundreds of files with each hash? This is only going to increase in frequency as the number of requests needed goes up with every release.

checksumUrls.map(async (url: string) => httpGetText(url))

@JLLeitschuh
Copy link
Contributor

It's not possible for us to know what version you have locally, so we have to fetch all of them.

Ill take a look at our Cloudflare logs and see if this is being caused by our infrastructure/firewall. Thanks for the ping 🙂

@codecholeric
Copy link

Also ran into this right now (and yesterday), re-triggered the job, then it worked:

Run gradle/[email protected]
  with:
    min-wrapper-count: 1
    allow-snapshots: false
Error: Client network socket disconnected before secure TLS connection was established

GitHub hosted action... Let me know if I can provide any more data that helps with this!

@LunNova
Copy link

LunNova commented Oct 29, 2021

@JLLeitschuh I was thinking adding the checksum inline to https://services.gradle.org/versions/all:

{
  "version" : "7.3-20211027231204+0000",
  "buildTime" : "20211027231204+0000",
  "current" : false,
  "snapshot" : true,
  "nightly" : false,
  "releaseNightly" : true,
  "activeRc" : false,
  "rcFor" : "",
  "milestoneFor" : "",
  "broken" : false,
  "downloadUrl" : "https://services.gradle.org/distributions-snapshots/gradle-7.3-20211027231204+0000-bin.zip",
  "checksumUrl" : "https://services.gradle.org/distributions-snapshots/gradle-7.3-20211027231204+0000-bin.zip.sha256",
  "wrapperChecksumUrl" : "https://services.gradle.org/distributions-snapshots/gradle-7.3-20211027231204+0000-wrapper.jar.sha256",
  "wrapperChecksum": "33ad4583fd7ee156f533778736fa1b4940bd83b433934d1cc4e9f608e99a6a89"
  // (The checksum would actually be shorter than the URL for where to go fetch it. ;))
},

Since the only field that gets used at the moment is the wrapper checksum, it might even be worth making a more specialized endpoint which is just a list of all wrapper checksums.

I have no idea where the code that generates/serves these is.

@gnarea
Copy link

gnarea commented Nov 5, 2021

I've been running into this issue occasionally ever since I integrated this action, but today it's been happening like 60% of the time on macOS on CI (CI also runs on Windows and Linux, but both seem fine).

I recently upgraded to Gradle 7, in case that's relevant.

gnarea added a commit to relaycorp/awala-keystore-file-jvm that referenced this issue Nov 5, 2021
gnarea added a commit to relaycorp/awala-keystore-file-jvm that referenced this issue Nov 5, 2021
To mitigate gradle/wrapper-validation-action#40

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
@JLLeitschuh
Copy link
Contributor

So, I've checked, and it's not our WAF causing theses issues. I'm not certain what would be causing these issues otherwise.

@nhouser9
Copy link

Running into the same issue today. Any updates on this?

@ouchadam
Copy link

also seeing this issue a few times every day, retrying tends to work straight away

2021-11-24T11:22:49,356896107+00:00

https://github.com/vector-im/element-android/actions/workflows/gradle-wrapper-validation.yml?query=is%3Afailure

@jrodbx
Copy link

jrodbx commented Nov 24, 2021

Seeing this a lot on Paparazzi builds, mainly with Windows workers
Example run: https://github.com/cashapp/paparazzi/runs/4316309670?check_suite_focus=true

@The-Code-Monkey
Copy link

Is there any further update on this as it keeps failing sporadically on both windows-2022 and ubuntu-20.04 action runs.

@DanySK
Copy link

DanySK commented Dec 4, 2021

I also keep getting CI failures due to this issue: Error: connect ETIMEDOUT 104.18.165.99:443.
This might be silly, but since relaunching usually fixes, I wonder whether just allowing for three retries or so could be helpful. In the end, connection timeouts may happen, unless some destination is unreachable it may make sense not to fail immediately.

@JLLeitschuh
Copy link
Contributor

We do have retry logic enabled.

{allowRetries: true, maxRetries: 3}

That being said, I have no evidence that it's actually working. A PR to improve debug logging from the community would be welcomed openly. Especially if it were implemented such that the additional logging was only printed when the build was going to fail anyways. I'd prefer to not make the action more chatty than it needs to be when it's not going to fail. I think the biggest problem we currently have is a severe lack of visibility. As such it makes it really difficult to figure out a root cause for these issues.

@madisp
Copy link

madisp commented Mar 21, 2023

still happens to this day, I've been unable to have a valid build after multiple retries :(

@JLLeitschuh
Copy link
Contributor

Are you running this on self-hosted runners, or is this running on GitHub's infrastructure?

@madisp
Copy link

madisp commented Mar 21, 2023

it was on GH infra, but it may have been an older fixed version of the action that suddenly started failing.

Upgraded from 8d49e559aae34d3e0eb16cde532684bc9702762b to ccb4328a959376b642e027874838f60f8e596de3, will report if it's still happening.

Out of curiosity, why is the Action making any request at all, wouldn't it make sense to store the valid hash-version pairs inside the action itself and only hit network if a local entry doesn't exist?

@JLLeitschuh
Copy link
Contributor

Not a bad idea. @bigdaz @eskatos what are your thoughts?

@bigdaz
Copy link
Member

bigdaz commented Mar 23, 2023

I have a feeling the action can/should be modified to vastly reduce the number of calls to services.gradle.org. The results could be cached in the GitHub Actions cache, and details of known-good versions could be possibly be bundled directly in the action itself (so remote calls would only be required for Gradle versions released after the last wrapper release.

This issue isn't being actively worked on at this time.

@DanySK
Copy link

DanySK commented Mar 23, 2023

@bigdaz it makes a lot of sense

@JLLeitschuh
Copy link
Contributor

The results could be cached in the GitHub Actions cache

The only thing I would be slightly concerned about is ensuring that it is impossible for an attacker to tamper with the cache prior to this action running.

@LunNova
Copy link

LunNova commented Mar 29, 2023

Is there a reason for gradle not to add the checksums inline in https://services.gradle.org/versions/all instead of requiring fetching a separate file per checksum?

@bigdaz
Copy link
Member

bigdaz commented Feb 1, 2024

This should be fixed by #167

@bigdaz bigdaz closed this as completed Feb 1, 2024
@bigdaz bigdaz added this to the v2.1.0 milestone Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests