Dirty exits during `on_failure` hooks do not consider retry limit properly #177

ajvondrak · 2024-05-30T00:09:49Z

Discovered a hairy issue! It involved a couple moving parts:

We had a job that was always raising a particular exception in perform due to some bad data.
The exception would trigger the on_failure_retry hook to run within the child process (as you'd expect).
This particular exception would not be retried, thus going through give_up and cleaning the retry key.
Then, as the on_failure_retry hook continued to delegate to our other Resque::Failure backends, we had one that would cause a rapid memory leak (not resque-retry's fault).
Thus, before the child's on_failure_retry could finish, the process would be SIGKILLed for running out of memory (cf. https://stackoverflow.com/a/46284715).
This leaves the parent Resque worker process to invoke job.fail with a Resque::DirtyExit (cf. https://github.com/resque/resque/blob/fa72b7b30adbbe9cf15d48784df78367b0294e82/lib/resque/worker.rb#L940).
So then the on_failure_retry hook would run again in the parent process with the Resque::DirtyExit, where we'd actually decide to retry it (unlike the initial exception).

However, since the retry key was deleted, the @retry_attempt comes in as 0 here:

resque-retry/lib/resque/plugins/retry.rb

Lines 504 to 507 in db5dfd4

    
           if exception.is_a?(Resque::DirtyExit) 
        
             # This hook is called from a worker processes, not the job process 
        
             # that failed with a DirtyExit, so @retry_attempt wasn't set yet 
        
             @retry_attempt = Resque.redis.get(redis_retry_key(*args)).to_i

Due to our particular failure pattern, the re-enqueued job would trigger the same exception as before in perform.
Thus, an infinite loop was formed where we'd keep trying the job, failing, killing the process, then retrying based on the Resque::DirtyExit, which would never hit the retry_limit because the key is always being deleted.

I tried coming up with a fix, but couldn't figure out how to juggle the subtleties. The best I could do is drum up a regression test that reproduces the failure for me locally: ajvondrak@7e795bf

diff --git a/test/retry_test.rb b/test/retry_test.rb
index f305134..f2fde92 100644
--- a/test/retry_test.rb
+++ b/test/retry_test.rb
@@ -325,5 +325,22 @@ def test_retry_on_dirty_exit
 
       assert_nil @worker.reserve
     end
+
+    def test_dirty_exits_during_failure_hooks
+      @worker.fork_per_job = true
+      Resque.enqueue(KilledDuringFailureHooksJob)
+
+      @worker.work_one_job
+      assert_equal 1, Resque.info[:processed], 'Did not process job'
+      assert_equal 1, Resque.info[:failed], 'Job should have failed'
+      assert_equal 1, Resque.info[:pending], 'Retry should be pending'
+      refute_nil Resque.peek(:testing), 'Retry should have been enqueued'
+
+      @worker.work_one_job
+      assert_equal 2, Resque.info[:processed], 'Did not process retry'
+      assert_equal 2, Resque.info[:failed], 'Retry should have failed'
+      assert_equal 0, Resque.info[:pending], 'Another retry should not be pending'
+      assert_nil Resque.peek(:testing), 'Should have only retried once'
+    end
   end
 end
diff --git a/test/test_jobs.rb b/test/test_jobs.rb
index 0e17b62..0ff4a16 100644
--- a/test/test_jobs.rb
+++ b/test/test_jobs.rb
@@ -613,6 +613,26 @@ def self.perform(*args)
   end
 end
 
+class KilledDuringFailureHooksJob
+  extend Resque::Plugins::Retry
+  @queue = :testing
+  @fatal_exceptions = [CustomException]
+
+  def self.perform
+    raise CustomException, "failed but won't retry"
+  end
+
+  # After clearing the retry key in the on_failure_retry supermethod, kill the
+  # child job's process in a way that triggers a Resque::DirtyExit in the
+  # parent worker process. The parent process will then run the
+  # on_failure_retry hook against the Resque::DirtyExit, which should retry by
+  # default.
+  def self.on_failure_retry(exception, *)
+    super
+    Process.kill("KILL", Process.pid) if exception.is_a?(CustomException)
+  end
+end
+
 class RetryCallbacksJob
   extend Resque::Plugins::Retry
   @queue = :testing

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dirty exits during `on_failure` hooks do not consider retry limit properly #177

Dirty exits during `on_failure` hooks do not consider retry limit properly #177

ajvondrak commented May 30, 2024 •

edited

Loading

Dirty exits during on_failure hooks do not consider retry limit properly #177

Dirty exits during on_failure hooks do not consider retry limit properly #177

Comments

ajvondrak commented May 30, 2024 • edited Loading

Dirty exits during `on_failure` hooks do not consider retry limit properly #177

Dirty exits during `on_failure` hooks do not consider retry limit properly #177

ajvondrak commented May 30, 2024 •

edited

Loading