Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove requests from public interfaces #157

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Dec 30, 2023

WaybackSession was originally a subclass of requests.Session, and that turns out to have been a poor decision (see #152 (comment)). We wanted to make customizing HTTP behavior relatively straightforward, but using a subclass-able version of requests.Session wound up baking problematic parts of requests into our API. In particular, requests is not thread-safe and this setup makes it hard to wrap with thread-safe code. We also unintentionally broke the promise of easy subclassing when we added complex retry and rate limiting behavior.

This is a major, breaking change designed to largely remove requests from our public API:

  • WaybackSession is now its own, standalone class instead of a subclass of requests.Session. It handles stuff like rate limiting, retries, etc.

  • WaybackHttpAdapter handles making a single HTTP request and is the interface through which we use the requests library.

    • It has a very small API (2 methods: .request() sends an HTTP request; .close() closes all its network connections).

    • The .request() method returns a WaybackHttpResponse object, which models just the parts of the response we need. The WaybackHttpResponse class is an abstract base class that represents the API that the rest of Wayback needs, and internally we have a WaybackRequestsResponse subclass that wraps a response object from Requests in a thread-safe way.

    • Ideally, the small API will make it easy for us to build an alternate implementation with urllib3, httpx, etc, or just to add thread-safe handling of requests. It might also be a better way to let users customize HTTP behavior than the old approach, although you can’t yet supply a custom subclass to WaybackClient or WaybackSession.

Open question: I don’t think that WaybackSession should continue to exist. It makes the API complex and it lets the details of requests leak out a little bit through exception handling. Any feedback on this is welcome (cc @danielballan). Some ideas:

  1. Move its functionality directly into WaybackHttpAdapter.

    • Upside: this gets all the requests stuff — including exception handling — tucked away into one place.
    • Downside: Safely customizing WaybackHttpAdapter gets more complex because you have to work around all that machinery. It’s not as straightforward as just overriding the .request() method. Some ways to address that:
      • Add a WaybackHttpAdapter._request() method (not the underscore) that you can override if you just want to customize the sending of a single HTTP request.
      • Re-architect that functionality into some kind of service/helper object that you can use from a customized WaybackHttpAdapter subclass. I think this might be a lot more work than it's worth, though.
  2. Move its functionality into WaybackClient.

    • Upside: keeps all the things that are special about dealing with HTTP requests at a high level out of the lower-level adapter tooling. Keeps adapters as simple as possible.
    • Downside: Lets some exception handling that is specific to requests leak out into the rest of the package. Makes it harder to change that behavior. What if httpx or some other tool incorporates rate limiting in the future? It would be nice to let an adapter delegate those tasks to the lower-level tool it is acting as glue for.

I think I’m leaning towards some version of (1) here.

This paves the way towards fixing #58, but does not actually solve the issue (WaybackHttpAdapter is not yet thread-safe, and needs to use some of the tricks from #23).

# Customize the built-in `send()` with retryability and rate limiting.
def send(self, request: requests.PreparedRequest, **kwargs):
def request(self, method, url, timeout=-1, **kwargs) -> WaybackHttpResponse:
super().request()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This super() call is really just to keep _utils.DisableAfterCloseSession working, and I think the reality here is that we should probably just get rid of that utility. It provides some nice guardrails, but they don’t offer a huge amount of value.

Comment on lines 445 to 549
def reset(self):
"Reset any network connections the session is using."
# Close really just closes all the adapters in `self.adapters`. We
# could do that directly, but `self.adapters` is not documented/public,
# so might be somewhat risky.
self.close(disable=False)
# Re-build the standard adapters. See:
# https://github.com/kennethreitz/requests/blob/v2.22.0/requests/sessions.py#L415-L418
self.mount('https://', requests.adapters.HTTPAdapter())
self.mount('http://', requests.adapters.HTTPAdapter())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I added this to help with some problems we were having managing connections in BIG jobs at EDGI, and that were later solved with better fixes. It’s kind of a weird method, and its presence is even weirder with the refactoring here. I think I should just drop it entirely.

src/wayback/_http.py Outdated Show resolved Hide resolved
Comment on lines +260 to +265
timeout : int or float or tuple of (int or float, int or float), optional
How long to wait, in seconds, before timing out. If this is a single
number, it will be used as both a connect timeout (how long to wait
for the first bit of response data) and a read timeout (how long
to wait between bits of response data). If a tuple, the values will
be used as the connect and read timeouts, respectively.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m a little worried this is too narrowly designed around how requests does it. It's probably not a huge deal, though, and keeps the timeout interface the same-ish as it is today.

assert response.ok
assert response.is_success
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is a bit arbitrary, but I was inspired by httpx's reasoning for the same change: https://www.python-httpx.org/compatibility/#checking-for-success-and-failure-responses

Considering whether we should also do this in the Memento class (maybe just a deprecation of .ok there, though).

Comment on lines +82 to +94
class WaybackHttpResponse:
"""
Represents an HTTP response from a server. This might be included as an
attribute of an exception, but should otherwise not be exposed to user
code in normal circumstances. It's meant to wrap to provide a standard,
thread-safe interface to response objects from whatever underlying HTTP
tool is being used (e.g. requests, httpx, etc.).
"""
status_code: int
headers: dict
encoding: Optional[str] = None
url: str
links: dict
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this class should move to _models, especially if we make customizing WaybackHttpAdapter a thing. (Since you’ll need to make your own subclass of this to return from an adapter that doesn’t use Requests.)

Sessions currently handle rate limiting and retries (and *need* to handle redirects so those get rate limited; it's a bug that they do not) and their implementation is a bit complicated. This separates out each of those jobs into a separate method to make management of those separate (but related) tasks a bit clearer. This should make the code a little easier to change, which will probably happen next because I think I want to remove `WaybackSession` entirely.
I think we actually broke this functionality somewhere, so it needs a test
Copy link
Member Author

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I went ahead and moved the retry/rate limiting functionality down to the adapter layer, which effectively makes WaybackHttpAdapter a replacement for WaybackSession. We could keep the old name, but I think the new name is good and helps create distance from requests.Session (and also makes it clear that all the other Session functionality from requests that someone could have relied on is no longer there).

I also tried to cover my discomfort with all the retry/rate limit stuff being complex to customize an adapter around by making it into a sort of mixin:

class RequestsAdapter(WaybackHttpAdapter):
    # Implements HTTP requests via the "requests" library.
    ...

class RetryAndRateLimitAdapter(WaybackHttpAdapter):
    # Mixin class that implements retrying and rate limiting
    # around the `adapter.request()` method.
    ...

class WaybackRequestsAdapter(RetryAndRateLimitAdapter, DisableAfterCloseAdapter, RequestsAdapter):
    # The actual adapter we want to use.
    ...

That’s a little complicated! But at least it means someone can leverage the existing retry and rate limiting functionality but plug in a different HTTP implementation. I think that’s good?

Anyway, there’s a lot of little cleanup stuff that has to happen here (also docs!!!!). But I think the overall shape of things is more-or-less what we want now.

Comment on lines +145 to +147
@property
def is_memento(self) -> bool:
return 'Memento-Datetime' in self.headers
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property is not as generic as everything else in this class (which sticks to “HTTP” as the level of abstraction), but it was very helpfully convenient to add here. If we wanted to be more strict we could pull it back out into a utility function.

Comment on lines +303 to +306
class RequestsAdapter(WaybackHttpAdapter):
"""
Wrap the requests library for making HTTP requests.
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have two similarly named classes, which might be kind of dangerous:

  • RequestsAdapter is private and just implements HTTP requests with the requests library. I wonder if it should have an underscore prefix or be renamed in some bigger way for clarity. (I do think keeping it separate is good, since it will get way more complex when we add thread safety in the next PR.)

  • WaybackRequestsAdapter adds retry, rate limiting, etc. to the above class. It is the actual adapter that gets used. I’m also wondering if this should be renamed to something more generic or have a generic alias like WaybackDefaultAdapter. That way the public name is something people can rely on, so they don’t break if/when we swap out requests for raw urllib3 or httpx.

Comment on lines +399 to +404
# XXX: Some of these are requests-specific and should move WaybackRequestsAdapter.
retryable_errors = (ConnectTimeoutError, MaxRetryError, ReadTimeoutError,
ProxyError, RetryError, Timeout)
# Handleable errors *may* be retryable, but need additional logic beyond
# just the error type. See `should_retry_error()`.
handleable_errors = (ConnectionError,) + retryable_errors
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These almost certainly need to move to either RequestsAdapter or WaybackRequestsAdapter in order for this to stay generic.

Comment on lines +422 to +430
# The implementation of different features is split up by method here, so
# `request()` calls down through a stack of overrides:
#
# request (ensure valid types/values/etc.)
# └> _request_redirectable (handle redirects)
# └> _request_retryable (handle retries for errors)
# └> _request_rate_limited (handle rate limiting/throttling)
# └> request_raw (handle actual HTTP)
def request(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks the class up and makes its definition longer, but hopefully it makes it a little more manageable than the way things were done in WaybackSession.

These could also be implemented as totally separate adapters, which might be cleaner? But it felt simpler to just keep them together for now.

This was also partly influenced by my realization that we have a bug where we don’t implement our own redirect following, which means redirects don’t get rate limited (sigh). It this gives us a good place to slide that complexity in later (_request_redirectable() is a pass-through in this PR).

Comment on lines 545 to 571
class WaybackRequestsAdapter(RetryAndRateLimitAdapter, DisableAfterCloseAdapter, RequestsAdapter):
def __init__(
self,
retries: int = 6,
backoff: Real = 2,
timeout: Union[Real, Tuple[Real, Real]] = 60,
user_agent: str = None,
memento_rate_limit: Union[RateLimit, Real] = DEFAULT_MEMENTO_RATE_LIMIT,
search_rate_limit: Union[RateLimit, Real] = DEFAULT_CDX_RATE_LIMIT,
timemap_rate_limit: Union[RateLimit, Real] = DEFAULT_TIMEMAP_RATE_LIMIT,
) -> None:
super().__init__(
retries=retries,
backoff=backoff,
rate_limits={
'/web/timemap': RateLimit.make_limit(timemap_rate_limit),
'/cdx': RateLimit.make_limit(search_rate_limit),
# The memento limit is actually a generic Wayback limit.
'/': RateLimit.make_limit(memento_rate_limit),
}
)
self.timeout = timeout
self.headers = {
'User-Agent': (user_agent or
f'wayback/{__version__} (+https://github.com/edgi-govdata-archiving/wayback)'),
'Accept-Encoding': 'gzip, deflate'
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I’m restructuring things, I wonder a bit whether the defaults for user_agent and the rate limits belong here, or in WaybackClient. As it is, someone wanting to provide a custom adapter needs to make sure they set the right defaults, which are very much tied to “correct” functioning of the overall library.

OTOH, keeping them here keeps configuration simple — you don’t really provide options to WaybackClient other than an instance of WaybackHttpAdapter, and all the various config options live on the adapter.

Comment on lines 582 to 584
) -> WaybackHttpResponse:
timeout = self.timeout if timeout is -1 else timeout
headers = dict(headers or {}, **self.headers)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if these somehow belong on the base WaybackHttpAdapter rather than here. 🤷

@@ -322,26 +320,25 @@ def __exit_all__(self, type, value, traceback):
pass


class DisableAfterCloseSession(requests.Session):
class DisableAfterCloseAdapter:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wound up keeping this but changing the implementation slightly (so it mixes in kind of like RetryAndRateLimitAdapter does). It should probably move to _http.py.

Comment on lines -104 to +106
class SessionClosedError(Exception):
class AlreadyClosedError(Exception):
"""
Raised when a Wayback session is used to make a request after it has been
Raised when a Wayback client is used to make a request after it has been
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn’t make sense to keep “session” in the name, but instead of replacing it with “adapter” I decided to go with making it a little more generic.

Comment on lines +782 to +786
@mock.patch('requests.Session.send', side_effect=return_user_agent)
def test_user_agent(self, mock_class):
adapter = WaybackRequestsAdapter()
agent = adapter.request('GET', 'http://test.com').text
assert agent.startswith('wayback/')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel lucky I even noticed we needed this! We probably should have had a test before — the previous implementation was subtle because it used built-in functionality of requests.Session that changed when we stopped subclassing that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

1 participant