Config to automatically Re-trigger failed periodics #268

smg247 · 2024-09-05T14:28:18Z

OpenShift has certain infra related periodics that run on a daily (or similar) frequency. This is only because the jobs are sometimes flaky, and the subsequent run will pass. The frequency could be reduced to weekly if there was a guarantee that the job would be retried a number of times if it fails.

A new config could be added to support automatically re-triggering a periodic ProwJob only in the case that it fails. It would accept the number of times to retry, and the interval at which to trigger the re-run. Something like the following to retrigger a failed job 3 times, 6 hours apart:

retrigger-failed-run:
  attempts: 3
  interval: 6h

Implementation details:
I think that it may be possible for plank to handle the retriggers

The text was updated successfully, but these errors were encountered:

smg247 · 2024-09-05T14:28:41Z

/cc @stbenjam

stbenjam · 2024-09-05T16:46:26Z

Thanks, this is great!

The only question I have is if the job fails, if we should be able to unequivocally run it 3 more times, or only run it until it succeeds. The former would help offset the bad signal and give us more confidence in the job's reliability.

Maybe configurable?

retrigger-failed-run:
  strategy: until_success | run_all
  attempts: 3
  interval: 6h

/cc @deads2k

deads2k · 2024-09-05T18:18:57Z

Configurable would be good.

Sometimes we want definitely three more times
Sometimes we want run up to three more times for a success.

petr-muller · 2024-09-06T09:56:11Z

Retest until success seems universally useful but I'm not a fan of having logic where a single passing job is good but a single failing job results in multiple passing jobs later.

I think that it may be possible for plank to handle the retriggers

Almost certainly not plank. Plank consumes Prowjobs, should not create them (unless you'd do the re-runs as additional Pods for a single Prowjob, for which we would need to rethink big parts of e.g. artifact reporting). I believe this belongs to horologium, especially with the interval: 6h config. We'd probably need some horologium-specific annotations on Prowjobs to recognize their position in a retest series and prevent each subsequent failure to cause a new round of retests.

There are more fun interactions to resolve, like how do the retests interact with standard interval-triggered periodics? Would they delay them?

stbenjam · 2024-09-06T13:39:56Z

Retest until success seems universally useful but I'm not a fan of having logic where a single passing job is good but a single failing job results in multiple passing jobs later.

For infrequently run jobs, we want to be able to still detect subtler regressions. If a developer makes an existing test go from 99% to 50%, we'll eventually get a failure on weekly runs -- that's our first hint there's something wrong, but we need more than 1 additional attempt to confirm. We'd get it eventually but it could take a month+. The unconditional attempts is a signal booster.

petr-muller · 2024-09-06T13:51:33Z

Okay, that makes sense, I see the value now 👍 It helps to amplify subtle decreases in reliability while saving resources because jobs that we think are solid may not need to run as often.

petr-muller added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Config to automatically Re-trigger failed periodics #268

Config to automatically Re-trigger failed periodics #268

smg247 commented Sep 5, 2024 •

edited

Loading

smg247 commented Sep 5, 2024

stbenjam commented Sep 5, 2024 •

edited

Loading

deads2k commented Sep 5, 2024

petr-muller commented Sep 6, 2024

stbenjam commented Sep 6, 2024 •

edited

Loading

petr-muller commented Sep 6, 2024 •

edited

Loading

Config to automatically Re-trigger failed periodics #268

Config to automatically Re-trigger failed periodics #268

Comments

smg247 commented Sep 5, 2024 • edited Loading

smg247 commented Sep 5, 2024

stbenjam commented Sep 5, 2024 • edited Loading

deads2k commented Sep 5, 2024

petr-muller commented Sep 6, 2024

stbenjam commented Sep 6, 2024 • edited Loading

petr-muller commented Sep 6, 2024 • edited Loading

smg247 commented Sep 5, 2024 •

edited

Loading

stbenjam commented Sep 5, 2024 •

edited

Loading

stbenjam commented Sep 6, 2024 •

edited

Loading

petr-muller commented Sep 6, 2024 •

edited

Loading