Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated retries? #502

Closed
dylanmtaylor opened this issue Feb 18, 2024 · 2 comments
Closed

Automated retries? #502

dylanmtaylor opened this issue Feb 18, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@dylanmtaylor
Copy link
Contributor

I see that build actions sometimes fail. I think we should leverage the retry action on ublue builds with an attempt limit of 3.
https://github.com/marketplace/actions/retry-action

That way if it's a weird network issue it something we won't have a day without a new image.

@bsherman
Copy link
Contributor

I've added requested changes to the associated PR ( #503 ), but I'll add some thoughts here for good measure.

@dylanmtaylor is very much correct that we sometimes have spurious failures (various causes which likely include network issues), and those result in the team needing to manually retry some runs of the workflow. As an example, a spurious failure will usually result in success for most of the matrix options, but one or two will fail.

I do agree that automatically retrying certain steps of the workflow will be helpful.

I identified the two most useful in my requested changes:

  1. Get current version was recently discovered to have intermittent but SILENT failures, so I fixed it to actually halt workflow execution rather than result in images which contain incomplete metadata. This is a simple step and a great candidate for automatic retry. In every failure case I've seen, it has been clear that it was some intermittent issue in network or on the ghcr servers which caused the failure.
  2. Push to GHCR is also a great candidate, since, if we reach this step, all that must be done is publish the cleanly built image. However, this does fail on occasion, and the fault is always some issue with network or on the GHCR server side. Likely our very large image sizes don't help us here.

I've specifically requested that we do NOT auto-retry the most complex step, Build Image. The most common causes of failure here are legitimate, usually due to an upstream RPM dependency issue. The one spurious issue I do know of in Build Image is related to the github-release-install.sh shell script which helps us install RPM packages direct from a project's github release. This is where I'd like to see an improvement to the shell script to handle those failures and retry internally. I've already made one such attempt with only partial success.

In addition to all this, I'd really like to see these improvements in ublue-os/main... but I hesitate to implement in the 6 other "foundational"/"hardware enablement" repos we maintain. We've already had some discussions on merging and cleaning them up as it's currently very messy to maintain them all as distinct repos.

Hope that provides some context to any reader regarding my views on this topic.

bsherman added a commit to ublue-os/ucore that referenced this issue Feb 19, 2024
A helpful issue was filed with PR which will help address some spurious
issues with the github actions workflows. That inspired me to improve
the github-release-install.sh script such that it will more properly
fail(retry) when http errors occur.

Relates: ublue-os/main#502
bsherman added a commit that referenced this issue Feb 19, 2024
A helpful issue was filed with PR which will help address some spurious
issues with the github actions workflows. That inspired me to improve
the github-release-install.sh script such that it will more properly
fail(retry) when http errors occur.

In addition, this includes an improvement to the script which allows
installing specific tags, not just the latest release.

Relates: #502
bsherman added a commit that referenced this issue Feb 22, 2024
The addresses spurious failures of pulling our (very large) base images by pre-pulling them to the build runner before using the buildah action.

Relates: #502
bsherman added a commit to ublue-os/akmods that referenced this issue Feb 23, 2024
These steps are known to potentially fail due to
environmental/infrastructure reasons.

Retries helps builds succeed despite that.

Relates: ublue-os/main#502
@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 30, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 7, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 7, 2024
@castrojo castrojo reopened this Jul 7, 2024
@bsherman
Copy link
Contributor

bsherman commented Jul 7, 2024

Actually, i think we should close this as "done" since we merged the PR at the top and have continued to add appropriate retry logic in various places throughout the project.

@bsherman bsherman closed this as completed Jul 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants