Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradle build: use caching mirror instead of Maven Central #4208

Open
cloudshiftchris opened this issue Jul 22, 2024 · 21 comments
Open

Gradle build: use caching mirror instead of Maven Central #4208

cloudshiftchris opened this issue Jul 22, 2024 · 21 comments
Labels
framework 🏗️ Pertains to the core structure and components of the Kotest framework. maintenance 🔨 This label is for issues and pull requests that involve routine updates, codebase cleanup, and other

Comments

@cloudshiftchris
Copy link

cloudshiftchris commented Jul 22, 2024

Maven Central has had to introduce throttling to manage the influx of traffic directly to the authoritative repositories.

Instead of using Maven Central directly, and contributing the load, use a caching mirror, e.g.:

      // use a caching mirror for maven central. https://www.sonatype.com/blog/maven-central-and-the-tragedy-of-the-commons
      maven("https://cache-redirector.jetbrains.com/repo1.maven.org/maven2") {
         name = "JetBrains Maven Central mirror"
      }

In most cases these repositories won't be accessed due to Gradle caching for local and CI builds; it will help when there are routine dependency updates, and moreso when there are major upgrades that flush the local and/or CI cache (new workstations, new developers, Gradle upgrade, etc).

This also applies to the Gradle Plugin Portal repository which redirects to Maven Central.

I'll create a PR for this...

@aSemy
Copy link
Contributor

aSemy commented Jul 22, 2024

Hey, thanks for raising this. I am very interested in improving performance and avoiding unnecessary downloading.

However, I don't think using JetBrain's cache redirectors is a good idea. They aren't documented, there's no reliability or stability guarantees, and they aren't intended for public use.

I also suspect that GitHub already provides some level of caching for downloading dependency (although I can't find any docs to support this). This is most evident when the Kotlin Native distributions are downloaded. 200MB in 10ms is quite impressive!

Kotest uses the setup-gradle action, which will use the GitHub Action cache. This will cache downloaded dependencies. (Although it's bugged at the moment, but that will hopefully be fixed by #4207.)

@cloudshiftchris
Copy link
Author

cloudshiftchris commented Jul 22, 2024

Thanks for the feedback. Had selected the JetBrains mirror as a) am using it elsewhere and b) it's used farther down in that settings.gradle.kts for npm & yarn deps.

Fair point re: unsupported/undocumented, unable to find any official references - seems like this has leaked out into the wild...

GitHub doesn't do caching directly - what we're seeing there is JetBrains redirecting to a CDN URL (that 10ms is the time to "download the redirect"), its a redirect to https://download-cdn.jetbrains.com/kotlin/native/builds/releases/1.9.24/linux-x86_64/kotlin-native-prebuilt-linux-x86_64-1.9.24.tar.gz (curl -v https://download.jetbrains.com/kotlin/native/builds/releases/1.9.24/linux-x86_64/kotlin-native-prebuilt-linux-x86_64-1.9.24.tar.gz), which is then likely resolved from local cache. The time to download that 200MB is ~2-3s once cached in the CDN: time curl -o /dev/null -v https://download-cdn.jetbrains.com/kotlin/native/builds/releases/1.9.24/linux-x86_64/kotlin-native-prebuilt-linux-x86_64-1.9.24.tar.gz

We could investigate other Maven Central mirrors, though the value here is fairly minimal - as noted above, its only spurious non-cached downloads in most cases - perhaps the effort to vet this isn't worth the value. lmk your thoughts.

@aSemy
Copy link
Contributor

aSemy commented Jul 23, 2024

Thanks for the feedback. Had selected the JetBrains mirror as a) am using it elsewhere and b) it's used farther down in that settings.gradle.kts for npm & yarn deps.

Ahh good point, I forgot about the NPM repo using the JB cache-redirector. It doesn't need to. I think I copy-pasted them from Dokka's build config, but I didn't update the URL.

GitHub doesn't do caching directly - what we're seeing there is JetBrains redirecting to a CDN URL (that 10ms is the time to "download the redirect"), its a redirect to https://download-cdn.jetbrains.com/kotlin/native/builds/releases/1.9.24/linux-x86_64/kotlin-native-prebuilt-linux-x86_64-1.9.24.tar.gz (curl -v https://download.jetbrains.com/kotlin/native/builds/releases/1.9.24/linux-x86_64/kotlin-native-prebuilt-linux-x86_64-1.9.24.tar.gz), which is then likely resolved from local cache. The time to download that 200MB is ~2-3s once cached in the CDN: time curl -o /dev/null -v https://download-cdn.jetbrains.com/kotlin/native/builds/releases/1.9.24/linux-x86_64/kotlin-native-prebuilt-linux-x86_64-1.9.24.tar.gz

Ahh that's what it is! Thanks for checking and explaining! Now that Kotest uses Gradle 8.9 the GitHub Actions cache will hopefully be less overloaded, and that gives more space for K/N dist caching, so I will resurrect the K/N dist caching PR.

We could investigate other Maven Central mirrors, though the value here is fairly minimal - as noted above, its only spurious non-cached downloads in most cases - perhaps the effort to vet this isn't worth the value. lmk your thoughts.

Yes, I agree.

Also, we can check the build scans to see how much is downloaded. Here are some scans from the latest master build.

They both show that dependencies were downloaded, and I think this must be because the GitHub Cache was still overloaded. It's currently under the 10GB limit, so let's see if the next build performs better...

image

@cloudshiftchris
Copy link
Author

Good stuff.

For the Gradle deps being downloaded - wondering if there isn't a race condition here; the setup-gradle action, on completion, clears out all artifacts unused since the start of the action (the code stores now at the start and uses that in the post-build cleanup). This is problematic for complex builds that run Gradle multiple times - an artifact/transform/... that is used in one build step may be removed by another. I'll see about crafting a test case for this and opening a ticket if this logic is correct.

For the K/N caching there are other GHA uses (e.g. https://github.com/kittinunf/Result/blob/master/.github/workflows/Release.yml), though they don't do crossOs as the PR does, which is preferable.

Related to caching/performance - the D: symlink for Gradle home will no longer be required as of setup-gradle v4 (currently in beta) - had created a ticket for this: gradle/actions#290.

@cloudshiftchris
Copy link
Author

cloudshiftchris commented Jul 23, 2024

it looks like there is a cache race-condition with multiple steps / multiple workflows that use the shared Gradle caches (distributions, dependencies, etc - anything in the action cache summary view that doesn't include the workflow/step name).

Opened a ticket here for it.

@aSemy
Copy link
Contributor

aSemy commented Jul 23, 2024

Good catch!

@aSemy
Copy link
Contributor

aSemy commented Jul 23, 2024

I've been pondering something else related to Maven Central, and I wonder what your thoughts are:

Currently Kotest publishes a snapshot release every. single. commit. to Sonatype. But I don't think this is needed. Kotest should debounce the 'publish snapshots' triggering commits, so that if 10 commits are merged to master in quick succession, only the last release is published. This would help Kotest's CI (since publishing all artifacts is slow, over 30 minutes), and also Sonatype (they don't need to host artifacts that aren't ever used).

This could be done by having a scheduled GitHub Action that only runs every 30minutes or so, and quits if there have been no commits to master since it last ran.

WDYT?

@cloudshiftchris
Copy link
Author

I'm assuming this is the master.yml workflow, which will run on every push to the master branch (which I assume are PR merges, so at least we aren't publishing before PRs are merged):

on:
   push:
      paths-ignore:
         - 'doc/**'
         - 'documentation/**'
         - '*.md'
         - '*.yml'
         - '.github/workflows/**'
      branches:
         - master

Agreed this is inefficient - as you suggested, publishing periodically to batch up commits would be preferable. The challenge (not insurmountable) will be to craft the logic for "when did we last publish" - what is stored, how do we check that.

There are some examples of using git log formatted output / bash scripting to determine the delta since last commit: https://stackoverflow.com/questions/73836626/github-action-that-fails-if-no-commit-for-24-hours, not a bad starting point. If we used "last commit" (different than "last publish") we'll need to handle the logic, e.g. > 30m < 60m -> publish, > 60m -> don't publish (assumption is that it was already published).

Or we store a 'last published' timestamp and use that.

We could also see if the event passed to the workflow for a scheduled run has any useful context to help (here's an example of an event for a push, not a scheduled job though, and the action to dump it out)

All that assumes a separate publish-snapshot workflow, which should include all the tests as well (otherwise, if tests are failing, the publish will happily take whatever is there every 30m and publish those snapshots, which presumably shouldn't be released in that state.

@cloudshiftchris
Copy link
Author

publish-snapshot step downloads a publish-marker artifact - if not present, or if delta > 30m, it publishes the snapshot, and uploads a new publish-marker artifact.

@sksamuel
Copy link
Member

Isn't there a github action to cancel concurrent jobs if another one is started after?
So we could have a general "build/test" workflow, which then triggers a "publish" workflow, which cancels if another one is started after (and we just keep doing that until eventually one completes?)

@cloudshiftchris
Copy link
Author

Toyed with this (there is an action that does this) - seems unpredictable to have in-flight builds that perhaps partway publish, more deterministic to explicitly control the lifecycle.

@sksamuel
Copy link
Member

that's fair.
is there a quiet period option built into GHA ?

@cloudshiftchris
Copy link
Author

There's nothing built-in - workflows are triggered via the on conditions (push, schedule, etc) - there's nothing there to batch / debounce / delay. As with so many event-driven solutions we're left to build higher-level constructs...

@sksamuel
Copy link
Member

If we used a cron to trigger a build every hour, we could use the SHA in the snapshot, so 1.2.3-6FE42A-SNAPSHOT and then check sonatype to see if it exists. If so, skip that run.

@cloudshiftchris
Copy link
Author

Interesting. That gives us two options for "last published" - query Sonatype (not sure what that entails) or store a "last published" marker artifact timestamp file.

@sksamuel
Copy link
Member

quering sonatype would be as simple as a curl to the repo to see if it's a 20x

@cloudshiftchris
Copy link
Author

@aSemy looks like I misunderstood parts of the setup-gradle caching - while not as space-efficient as it could be there isn't a race condition.

Dug into this actions run for this commit that only contains changes to Kotlin files (i.e. since no deps have changed we'd expect all Gradle deps to load from cache).

The first job show no network activity (from the build scan).

The build scan for the second job 'Validate on primary runner / run-tests summary', specifically the :check Gradle task, shows 779 files / 313MB downloaded.

The jobs for Windows & Mac resolve all dependencies from cache (no network activity).

The final job has considerable network activity - but that's expected, its to oss.sonatype.org setting everything up for publishing.

Given that all Gradle executions converge to run-gradle.yml and hence have the same "setup-gradle" would expect all dependencies to be cached, but that isn't the case. The total cache size is now 8.85GB / 10GB, so we aren't overflowing.

Will look further into this / compare other action runs to see if we can improve on the Gradle deps caching. lmk if you have ideas on this.

@cloudshiftchris
Copy link
Author

cloudshiftchris commented Jul 24, 2024

Ok, so...

  1. validate-api job (:apiCheck) works fine wrt storing / restoring Gradle deps from cache;
  2. validate-primary job (:check) is not storing cache entries:

(while this is for the deps cache there are many other cache entries with the same challenge)

Entry: /home/runner/.gradle/caches/modules-*/files-*/*/*/*/*
    Requested Key : gradle-dependencies-v1-a702ec8f84890c954ac77cc8b9b7e1be
    Restored  Key : gradle-dependencies-v1-a702ec8f84890c954ac77cc8b9b7e1be
              Size: 445 MB (466172742 B)
              (Entry restored: exact match found)
    Saved     Key : 
              Size: 
              (Entry not saved: referencing 'Gradle User Home' cache entry not saved)
  1. the secondary Mac/windows executions work fine wrt storing / restoring Gradle deps from cache

If I understand the issue correctly - because Gradle home cache didn't change between #1 and #2 (same OS, commit SHA etc) the setup-action doesn't store the sub-caches, so will keep re-downloading the deps. #3 works as the OS change results in separate cache entries.

Perhaps an easy fix - collapse those first two runs, equivalent of ./gradlew apiCheck check.

@aSemy
Copy link
Contributor

aSemy commented Jul 24, 2024

There's at least one free service for debouncing webhooks: https://hookbox.freighter.studio/.

But I think the easiest way would be to have a scheduled GitHub Workflow that runs every 60 minutes (or something like that). The workflow could also be triggered on-demand, if we happen to be eager to release.

Every time the workflow runs, it will use the GitHub CLI/API to determine the status of the last commit to master, and check if it had a successful publishing run. If it did, the workflow quits. If it didn't, then it launches the 'publish all' workflow.

Here's a demo of how to get the information using the GitHub CLI:

#!/bin/zsh

# determine the latest commit
REF=$(gh api repos/kotest/kotest/branches/master --jq '.commit.sha')

echo Latest sha "$REF"

# Determine whether the 'publish' Workflow was successful
CONCLUSION=$(gh api /repos/kotest/kotest/commits/"$REF"/check-runs \
  --jq '.check_runs[] | select(.name | contains("Publish all artifacts")) .conclusion');

echo publish all artifacts result: "$CONCLUSION"

@aSemy
Copy link
Contributor

aSemy commented Jul 24, 2024

Perhaps an easy fix - collapse those first two runs, equivalent of ./gradlew apiCheck check.

Yes, I'd like to do this. It'd also be convenient to run gradle check for each OS using a 'matrix', and set fail-fast=true. That would make the GitHub action much faster.

(check depends on apiCheck, so just ./gradlew check will be enough.)

@aSemy
Copy link
Contributor

aSemy commented Jul 24, 2024

It'd also be great to set up nexus-staging actions. At the moment Kotest publishing has to be done as slow as possible, otherwise Sonatype gets confused and creates split repos. But if an action opened a repo ahead of time, then Kotest could publish (in parallel) to that repo, without issue.

github-merge-queue bot pushed a commit that referenced this issue Jul 28, 2024
Using the JetBrains cache redirector for nodejs.org has no benefit.
nodejs.org can be used directly.

See #4208
@LeoColman LeoColman added framework 🏗️ Pertains to the core structure and components of the Kotest framework. maintenance 🔨 This label is for issues and pull requests that involve routine updates, codebase cleanup, and other labels Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
framework 🏗️ Pertains to the core structure and components of the Kotest framework. maintenance 🔨 This label is for issues and pull requests that involve routine updates, codebase cleanup, and other
Projects
None yet
Development

No branches or pull requests

4 participants