Skip to content
This repository has been archived by the owner on Nov 16, 2021. It is now read-only.

Differentiate retries for NOT_OPEN and TIMED_OUT transport exceptions #56

Open
ryangreenberg opened this issue Mar 13, 2014 · 9 comments

Comments

@ryangreenberg
Copy link
Contributor

The underlying thrift Ruby library raises TransportException under a variety of circumstances:

From thrift's socket.rb:

raise TransportException.new(TransportException::NOT_OPEN, "Connection timeout to #{@desc}")
raise TransportException.new(TransportException::NOT_OPEN, "Could not connect to #{@desc}: #{e}")
raise TransportException.new(TransportException::TIMED_OUT, "Socket: Timed out writing #{str.length} bytes to #{@desc}")
raise TransportException.new(TransportException::NOT_OPEN, e.message)
raise TransportException.new(TransportException::TIMED_OUT, "Socket: Timed out reading #{sz} bytes from #{@desc}")
raise TransportException.new(TransportException::NOT_OPEN, e.message)
raise TransportException.new(TransportException::UNKNOWN, "Socket: Could not read #{sz} bytes from #{@desc}")

The type of transport exception is a property of the exception instance (NOT_OPEN, TIMED_OUT, or UNKNOWN). Currently thrift_client provides the ability to retry only at the level of an exception class (e.g. TransportException). For non-idempotent requests, TIMED_OUT and UNKNOWN are not safe to retry, but NOT_OPEN should be safe. An ideal solution would be to have the Ruby thrift library itself raise subclasses of TransportException. In the interim, does this seem like a reasonable retry strategy to add to thrift_client? If so, I'm happy to do the implementation but would like some input on a good interface for configuring ThriftClient.

Some ideas I had:

  • Add a new top-level option that lets users give a block to determine whether the exception should be retried
  • Don't add any new options, but have thrift_client instead convert Thrift::TransportException to ThriftClient::TransportException::NotOpen, ThriftClient::TransportException::TimedOut, and ThriftClient::TransportException::Unknown all of which subclass Thrift::TransportException, preserving existing behavior.
@ryangreenberg
Copy link
Contributor Author

@klingerf and @sprsquish, I know this library isn't something that you work on day-to-day, but if you have a moment I'd love to hear your take on this issue.

@sprsquish
Copy link
Contributor

I think the second option might be your best bet here as it maintains backward compatibility.

@matasar
Copy link
Contributor

matasar commented Mar 13, 2014

Wouldn't the block option also preserve compatibility? I think that would be a better API in general, and it would be less likely to break if people are directly comparing exception classes with ==.

@sprsquish
Copy link
Contributor

What does the API look like in that case?

@ryangreenberg
Copy link
Contributor Author

Current options (from the gem docs):

:exception_classes: Which exceptions to catch and retry when sending a request. Defaults to [IOError, Thrift::Exception, Thrift::ApplicationException, Thrift::TransportException, NoServersAvailable]
exception_class_overrides: For specifying children of classes in exception_classes for which you don't want to retry or reconnect.

Ways to allow a block for retry behavior and maintain backwards compatability:

  • accept either an array or a block for exception_classes (let's call this option A)
  • add a new optional parameter, name TBD, like retry_exception_if (option B)

In order to use this new syntax to retry NOT_OPEN TransportException and raise TIMED_OUT TransportException, I would do something like:

Option A.

@client = ThriftClient.new(
  MyCustomService::Client,
  servers,
  :transport_wrapper => Thrift::FramedTransport,
  :timeout           => 0.500,
  :connect_timeout   => 0.50,
  :exception_classes => lambda do |ex|
    # whatever logic you want here
    [IOError, Thrift::ApplicationException, Thrift::TransportException, NoServersAvailable].any? {|ea| ex.is_a?(ea) } ||
      ( ex.is_a?(TransportException) && ex.type == NOT_OPEN )
  end,
  :raise             => true,
)

Option B.

@client = ThriftClient.new(
  MyCustomService::Client,
  servers,
  :transport_wrapper => Thrift::FramedTransport,
  :timeout           => 0.500,
  :connect_timeout   => 0.50,
  :exception_classes => [IOError, Thrift::ApplicationException, Thrift::TransportException, NoServersAvailable], # have to provide so defaults with Thrift::Transport aren't used
  :retry_exception_if  => lambda do |ex|
    ( ex.is_a?(TransportException) && ex.type == NOT_OPEN )
  end
)

I'm very flexible on the actual key name retry_exception_if in option B.

I have a slight preference for A, just because I think exception_classes + exception_class_overrides + retry_exception_if is a more confusing interface for users to consider, but I can always be convinced otherwise.

@sprsquish
Copy link
Contributor

I think A is a bit confusing, I'd expect something called "exception_classes" to contain a list of classes. How about option B where, if the block is provided then exception_classes is ignored? That would provide backward compatibility.

Given that you want to change the default behavior though, you'd have to add the new exception classes rather than using the block. Maybe a hybrid approach: do the exception unwrapping, add the new exceptions to the default list, and add an option to provide a block that overrides exception_classes completely.

@ryangreenberg
Copy link
Contributor Author

@sprsquish good point--providing something other than a list of classes as exception_classes is weird. Let me first try pursuing this from another angle to see if the Thrift project is receptive to the idea of adding Thrift::TransportException::NotOpen and Thrift::TransportException::TimedOut to the underlying library.

@ryangreenberg
Copy link
Contributor Author

@divtxt
Copy link

divtxt commented Feb 12, 2015

Please consider stale idle connections in the group of safe to retry. I have the following scenario:

  • long-running unicorn process talks to long-running thrift server API
  • the unicorn process keeps the socket connection open and reuses it for future calls as they happen
  • thrift server is restarted
  • unicorn process tries to use the old stale connection, gets an error and incorrectly considers that server dead/error

Here the client should check for a remotely closed idle connection and treat it like NOT_OPEN.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

4 participants