You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Discovered a hairy issue! It involved a couple moving parts:
We had a job that was always raising a particular exception in perform due to some bad data.
The exception would trigger the on_failure_retry hook to run within the child process (as you'd expect).
This particular exception would not be retried, thus going through give_up and cleaning the retry key.
Then, as the on_failure_retry hook continued to delegate to our other Resque::Failure backends, we had one that would cause a rapid memory leak (not resque-retry's fault).
Thus, before the child's on_failure_retry could finish, the process would be SIGKILLed for running out of memory (cf. https://stackoverflow.com/a/46284715).
So then the on_failure_retry hook would run again in the parent process with the Resque::DirtyExit, where we'd actually decide to retry it (unlike the initial exception).
However, since the retry key was deleted, the @retry_attempt comes in as 0 here:
Due to our particular failure pattern, the re-enqueued job would trigger the same exception as before in perform.
Thus, an infinite loop was formed where we'd keep trying the job, failing, killing the process, then retrying based on the Resque::DirtyExit, which would never hit the retry_limit because the key is always being deleted.
I tried coming up with a fix, but couldn't figure out how to juggle the subtleties. The best I could do is drum up a regression test that reproduces the failure for me locally: ajvondrak@7e795bf
diff --git a/test/retry_test.rb b/test/retry_test.rb
index f305134..f2fde92 100644
--- a/test/retry_test.rb+++ b/test/retry_test.rb@@ -325,5 +325,22 @@ def test_retry_on_dirty_exit
assert_nil @worker.reserve
end
++ def test_dirty_exits_during_failure_hooks+ @worker.fork_per_job = true+ Resque.enqueue(KilledDuringFailureHooksJob)++ @worker.work_one_job+ assert_equal 1, Resque.info[:processed], 'Did not process job'+ assert_equal 1, Resque.info[:failed], 'Job should have failed'+ assert_equal 1, Resque.info[:pending], 'Retry should be pending'+ refute_nil Resque.peek(:testing), 'Retry should have been enqueued'++ @worker.work_one_job+ assert_equal 2, Resque.info[:processed], 'Did not process retry'+ assert_equal 2, Resque.info[:failed], 'Retry should have failed'+ assert_equal 0, Resque.info[:pending], 'Another retry should not be pending'+ assert_nil Resque.peek(:testing), 'Should have only retried once'+ end
end
end
diff --git a/test/test_jobs.rb b/test/test_jobs.rb
index 0e17b62..0ff4a16 100644
--- a/test/test_jobs.rb+++ b/test/test_jobs.rb@@ -613,6 +613,26 @@ def self.perform(*args)
end
end
+class KilledDuringFailureHooksJob+ extend Resque::Plugins::Retry+ @queue = :testing+ @fatal_exceptions = [CustomException]++ def self.perform+ raise CustomException, "failed but won't retry"+ end++ # After clearing the retry key in the on_failure_retry supermethod, kill the+ # child job's process in a way that triggers a Resque::DirtyExit in the+ # parent worker process. The parent process will then run the+ # on_failure_retry hook against the Resque::DirtyExit, which should retry by+ # default.+ def self.on_failure_retry(exception, *)+ super+ Process.kill("KILL", Process.pid) if exception.is_a?(CustomException)+ end+end+
class RetryCallbacksJob
extend Resque::Plugins::Retry
@queue = :testing
The text was updated successfully, but these errors were encountered:
Discovered a hairy issue! It involved a couple moving parts:
perform
due to some bad data.on_failure_retry
hook to run within the child process (as you'd expect).give_up
and cleaning the retry key.on_failure_retry
hook continued to delegate to our otherResque::Failure
backends, we had one that would cause a rapid memory leak (not resque-retry's fault).on_failure_retry
could finish, the process would be SIGKILLed for running out of memory (cf. https://stackoverflow.com/a/46284715).job.fail
with aResque::DirtyExit
(cf. https://github.com/resque/resque/blob/fa72b7b30adbbe9cf15d48784df78367b0294e82/lib/resque/worker.rb#L940).on_failure_retry
hook would run again in the parent process with theResque::DirtyExit
, where we'd actually decide to retry it (unlike the initial exception).@retry_attempt
comes in as 0 here:resque-retry/lib/resque/plugins/retry.rb
Lines 504 to 507 in db5dfd4
perform
.Resque::DirtyExit
, which would never hit theretry_limit
because the key is always being deleted.I tried coming up with a fix, but couldn't figure out how to juggle the subtleties. The best I could do is drum up a regression test that reproduces the failure for me locally: ajvondrak@7e795bf
The text was updated successfully, but these errors were encountered: