Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dirty exits during on_failure hooks do not consider retry limit properly #177

Open
ajvondrak opened this issue May 30, 2024 · 0 comments
Open

Comments

@ajvondrak
Copy link

ajvondrak commented May 30, 2024

Discovered a hairy issue! It involved a couple moving parts:

  1. We had a job that was always raising a particular exception in perform due to some bad data.
  2. The exception would trigger the on_failure_retry hook to run within the child process (as you'd expect).
  3. This particular exception would not be retried, thus going through give_up and cleaning the retry key.
  4. Then, as the on_failure_retry hook continued to delegate to our other Resque::Failure backends, we had one that would cause a rapid memory leak (not resque-retry's fault).
  5. Thus, before the child's on_failure_retry could finish, the process would be SIGKILLed for running out of memory (cf. https://stackoverflow.com/a/46284715).
  6. This leaves the parent Resque worker process to invoke job.fail with a Resque::DirtyExit (cf. https://github.com/resque/resque/blob/fa72b7b30adbbe9cf15d48784df78367b0294e82/lib/resque/worker.rb#L940).
  7. So then the on_failure_retry hook would run again in the parent process with the Resque::DirtyExit, where we'd actually decide to retry it (unlike the initial exception).
  8. However, since the retry key was deleted, the @retry_attempt comes in as 0 here:
    if exception.is_a?(Resque::DirtyExit)
    # This hook is called from a worker processes, not the job process
    # that failed with a DirtyExit, so @retry_attempt wasn't set yet
    @retry_attempt = Resque.redis.get(redis_retry_key(*args)).to_i
  9. Due to our particular failure pattern, the re-enqueued job would trigger the same exception as before in perform.
  10. Thus, an infinite loop was formed where we'd keep trying the job, failing, killing the process, then retrying based on the Resque::DirtyExit, which would never hit the retry_limit because the key is always being deleted.

I tried coming up with a fix, but couldn't figure out how to juggle the subtleties. The best I could do is drum up a regression test that reproduces the failure for me locally: ajvondrak@7e795bf

diff --git a/test/retry_test.rb b/test/retry_test.rb
index f305134..f2fde92 100644
--- a/test/retry_test.rb
+++ b/test/retry_test.rb
@@ -325,5 +325,22 @@ def test_retry_on_dirty_exit
 
       assert_nil @worker.reserve
     end
+
+    def test_dirty_exits_during_failure_hooks
+      @worker.fork_per_job = true
+      Resque.enqueue(KilledDuringFailureHooksJob)
+
+      @worker.work_one_job
+      assert_equal 1, Resque.info[:processed], 'Did not process job'
+      assert_equal 1, Resque.info[:failed], 'Job should have failed'
+      assert_equal 1, Resque.info[:pending], 'Retry should be pending'
+      refute_nil Resque.peek(:testing), 'Retry should have been enqueued'
+
+      @worker.work_one_job
+      assert_equal 2, Resque.info[:processed], 'Did not process retry'
+      assert_equal 2, Resque.info[:failed], 'Retry should have failed'
+      assert_equal 0, Resque.info[:pending], 'Another retry should not be pending'
+      assert_nil Resque.peek(:testing), 'Should have only retried once'
+    end
   end
 end
diff --git a/test/test_jobs.rb b/test/test_jobs.rb
index 0e17b62..0ff4a16 100644
--- a/test/test_jobs.rb
+++ b/test/test_jobs.rb
@@ -613,6 +613,26 @@ def self.perform(*args)
   end
 end
 
+class KilledDuringFailureHooksJob
+  extend Resque::Plugins::Retry
+  @queue = :testing
+  @fatal_exceptions = [CustomException]
+
+  def self.perform
+    raise CustomException, "failed but won't retry"
+  end
+
+  # After clearing the retry key in the on_failure_retry supermethod, kill the
+  # child job's process in a way that triggers a Resque::DirtyExit in the
+  # parent worker process. The parent process will then run the
+  # on_failure_retry hook against the Resque::DirtyExit, which should retry by
+  # default.
+  def self.on_failure_retry(exception, *)
+    super
+    Process.kill("KILL", Process.pid) if exception.is_a?(CustomException)
+  end
+end
+
 class RetryCallbacksJob
   extend Resque::Plugins::Retry
   @queue = :testing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant