-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Config to automatically Re-trigger failed periodics #268
Comments
/cc @stbenjam |
Thanks, this is great! The only question I have is if the job fails, if we should be able to unequivocally run it 3 more times, or only run it until it succeeds. The former would help offset the bad signal and give us more confidence in the job's reliability. Maybe configurable? retrigger-failed-run:
strategy: until_success | run_all
attempts: 3
interval: 6h /cc @deads2k |
Configurable would be good.
|
Retest until success seems universally useful but I'm not a fan of having logic where a single passing job is good but a single failing job results in multiple passing jobs later.
Almost certainly not plank. Plank consumes Prowjobs, should not create them (unless you'd do the re-runs as additional Pods for a single Prowjob, for which we would need to rethink big parts of e.g. artifact reporting). I believe this belongs to horologium, especially with the There are more fun interactions to resolve, like how do the retests interact with standard interval-triggered periodics? Would they delay them? |
For infrequently run jobs, we want to be able to still detect subtler regressions. If a developer makes an existing test go from 99% to 50%, we'll eventually get a failure on weekly runs -- that's our first hint there's something wrong, but we need more than 1 additional attempt to confirm. We'd get it eventually but it could take a month+. The unconditional attempts is a signal booster. |
Okay, that makes sense, I see the value now 👍 It helps to amplify subtle decreases in reliability while saving resources because jobs that we think are solid may not need to run as often. |
OpenShift has certain infra related periodics that run on a daily (or similar) frequency. This is only because the jobs are sometimes flaky, and the subsequent run will pass. The frequency could be reduced to weekly if there was a guarantee that the job would be retried a number of times if it fails.
A new config could be added to support automatically re-triggering a periodic ProwJob only in the case that it fails. It would accept the number of times to retry, and the interval at which to trigger the re-run. Something like the following to retrigger a failed job
3
times,6
hours apart:Implementation details:
I think that it may be possible for
plank
to handle the retriggersThe text was updated successfully, but these errors were encountered: