Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config to automatically Re-trigger failed periodics #268

Open
smg247 opened this issue Sep 5, 2024 · 6 comments
Open

Config to automatically Re-trigger failed periodics #268

smg247 opened this issue Sep 5, 2024 · 6 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@smg247
Copy link
Contributor

smg247 commented Sep 5, 2024

OpenShift has certain infra related periodics that run on a daily (or similar) frequency. This is only because the jobs are sometimes flaky, and the subsequent run will pass. The frequency could be reduced to weekly if there was a guarantee that the job would be retried a number of times if it fails.

A new config could be added to support automatically re-triggering a periodic ProwJob only in the case that it fails. It would accept the number of times to retry, and the interval at which to trigger the re-run. Something like the following to retrigger a failed job 3 times, 6 hours apart:

retrigger-failed-run:
  attempts: 3
  interval: 6h

Implementation details:
I think that it may be possible for plank to handle the retriggers

@smg247
Copy link
Contributor Author

smg247 commented Sep 5, 2024

/cc @stbenjam

@stbenjam
Copy link
Contributor

stbenjam commented Sep 5, 2024

Thanks, this is great!

The only question I have is if the job fails, if we should be able to unequivocally run it 3 more times, or only run it until it succeeds. The former would help offset the bad signal and give us more confidence in the job's reliability.

Maybe configurable?

retrigger-failed-run:
  strategy: until_success | run_all
  attempts: 3
  interval: 6h

/cc @deads2k

@deads2k
Copy link
Contributor

deads2k commented Sep 5, 2024

Configurable would be good.

  1. Sometimes we want definitely three more times
  2. Sometimes we want run up to three more times for a success.

@petr-muller
Copy link
Contributor

Retest until success seems universally useful but I'm not a fan of having logic where a single passing job is good but a single failing job results in multiple passing jobs later.

I think that it may be possible for plank to handle the retriggers

Almost certainly not plank. Plank consumes Prowjobs, should not create them (unless you'd do the re-runs as additional Pods for a single Prowjob, for which we would need to rethink big parts of e.g. artifact reporting). I believe this belongs to horologium, especially with the interval: 6h config. We'd probably need some horologium-specific annotations on Prowjobs to recognize their position in a retest series and prevent each subsequent failure to cause a new round of retests.

There are more fun interactions to resolve, like how do the retests interact with standard interval-triggered periodics? Would they delay them?

@stbenjam
Copy link
Contributor

stbenjam commented Sep 6, 2024

Retest until success seems universally useful but I'm not a fan of having logic where a single passing job is good but a single failing job results in multiple passing jobs later.

For infrequently run jobs, we want to be able to still detect subtler regressions. If a developer makes an existing test go from 99% to 50%, we'll eventually get a failure on weekly runs -- that's our first hint there's something wrong, but we need more than 1 additional attempt to confirm. We'd get it eventually but it could take a month+. The unconditional attempts is a signal booster.

@petr-muller
Copy link
Contributor

petr-muller commented Sep 6, 2024

Okay, that makes sense, I see the value now 👍 It helps to amplify subtle decreases in reliability while saving resources because jobs that we think are solid may not need to run as often.

@petr-muller petr-muller added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

4 participants