Remove `requests` from public interfaces #157

Mr0grog · 2023-12-30T00:35:27Z

WaybackSession was originally a subclass of requests.Session, and that turns out to have been a poor decision (see #152 (comment)). We wanted to make customizing HTTP behavior relatively straightforward, but using a subclass-able version of requests.Session wound up baking problematic parts of requests into our API. In particular, requests is not thread-safe and this setup makes it hard to wrap with thread-safe code. We also unintentionally broke the promise of easy subclassing when we added complex retry and rate limiting behavior.

This is a major, breaking change designed to largely remove requests from our public API:

WaybackSession is now its own, standalone class instead of a subclass of requests.Session. It handles stuff like rate limiting, retries, etc.
WaybackHttpAdapter handles making a single HTTP request and is the interface through which we use the requests library.
- It has a very small API (2 methods: .request() sends an HTTP request; .close() closes all its network connections).
- The .request() method returns a WaybackHttpResponse object, which models just the parts of the response we need. The WaybackHttpResponse class is an abstract base class that represents the API that the rest of Wayback needs, and internally we have a WaybackRequestsResponse subclass that wraps a response object from Requests in a thread-safe way.
- Ideally, the small API will make it easy for us to build an alternate implementation with urllib3, httpx, etc, or just to add thread-safe handling of requests. It might also be a better way to let users customize HTTP behavior than the old approach, although you can’t yet supply a custom subclass to WaybackClient or WaybackSession.

❗ Open question: I don’t think that WaybackSession should continue to exist. It makes the API complex and it lets the details of requests leak out a little bit through exception handling. Any feedback on this is welcome (cc @danielballan). Some ideas:

Move its functionality directly into WaybackHttpAdapter.
- Upside: this gets all the requests stuff — including exception handling — tucked away into one place.
- Downside: Safely customizing WaybackHttpAdapter gets more complex because you have to work around all that machinery. It’s not as straightforward as just overriding the .request() method. Some ways to address that:
  - Add a WaybackHttpAdapter._request() method (not the underscore) that you can override if you just want to customize the sending of a single HTTP request.
  - Re-architect that functionality into some kind of service/helper object that you can use from a customized WaybackHttpAdapter subclass. I think this might be a lot more work than it's worth, though.
Move its functionality into WaybackClient.
- Upside: keeps all the things that are special about dealing with HTTP requests at a high level out of the lower-level adapter tooling. Keeps adapters as simple as possible.
- Downside: Lets some exception handling that is specific to requests leak out into the rest of the package. Makes it harder to change that behavior. What if httpx or some other tool incorporates rate limiting in the future? It would be nice to let an adapter delegate those tasks to the lower-level tool it is acting as glue for.

I think I’m leaning towards some version of (1) here.

This paves the way towards fixing #58, but does not actually solve the issue (WaybackHttpAdapter is not yet thread-safe, and needs to use some of the tricks from #23).

This also provides some basic (but not yet fully implemented) thread-safety for response objects.

Also re-organize methods a bit

Mr0grog · 2023-12-30T00:39:30Z

src/wayback/_client.py

-    # Customize the built-in `send()` with retryability and rate limiting.
-    def send(self, request: requests.PreparedRequest, **kwargs):
+    def request(self, method, url, timeout=-1, **kwargs) -> WaybackHttpResponse:
+        super().request()


This super() call is really just to keep _utils.DisableAfterCloseSession working, and I think the reality here is that we should probably just get rid of that utility. It provides some nice guardrails, but they don’t offer a huge amount of value.

Mr0grog · 2023-12-30T00:44:56Z

src/wayback/_client.py

    def reset(self):
        "Reset any network connections the session is using."
-        # Close really just closes all the adapters in `self.adapters`. We
-        # could do that directly, but `self.adapters` is not documented/public,
-        # so might be somewhat risky.
        self.close(disable=False)
-        # Re-build the standard adapters. See:
-        # https://github.com/kennethreitz/requests/blob/v2.22.0/requests/sessions.py#L415-L418
-        self.mount('https://', requests.adapters.HTTPAdapter())
-        self.mount('http://', requests.adapters.HTTPAdapter())


IIRC, I added this to help with some problems we were having managing connections in BIG jobs at EDGI, and that were later solved with better fixes. It’s kind of a weird method, and its presence is even weirder with the refactoring here. I think I should just drop it entirely.

src/wayback/_http.py

Mr0grog · 2023-12-30T00:50:49Z

src/wayback/_http.py

+        timeout : int or float or tuple of (int or float, int or float), optional
+            How long to wait, in seconds, before timing out. If this is a single
+            number, it will be used as both a connect timeout (how long to wait
+            for the first bit of response data) and a read timeout (how long
+            to wait between bits of response data). If a tuple, the values will
+            be used as the connect and read timeouts, respectively.


I’m a little worried this is too narrowly designed around how requests does it. It's probably not a huge deal, though, and keeps the timeout interface the same-ish as it is today.

Mr0grog · 2023-12-30T00:52:59Z

src/wayback/tests/test_client.py

-        assert response.ok
+        assert response.is_success


This change is a bit arbitrary, but I was inspired by httpx's reasoning for the same change: https://www.python-httpx.org/compatibility/#checking-for-success-and-failure-responses

Considering whether we should also do this in the Memento class (maybe just a deprecation of .ok there, though).

Mr0grog · 2023-12-30T00:59:06Z

src/wayback/_http.py

+class WaybackHttpResponse:
+    """
+    Represents an HTTP response from a server. This might be included as an
+    attribute of an exception, but should otherwise not be exposed to user
+    code in normal circumstances. It's meant to wrap to provide a standard,
+    thread-safe interface to response objects from whatever underlying HTTP
+    tool is being used (e.g. requests, httpx, etc.).
+    """
+    status_code: int
+    headers: dict
+    encoding: Optional[str] = None
+    url: str
+    links: dict


Wondering if this class should move to _models, especially if we make customizing WaybackHttpAdapter a thing. (Since you’ll need to make your own subclass of this to return from an adapter that doesn’t use Requests.)

Sessions currently handle rate limiting and retries (and *need* to handle redirects so those get rate limited; it's a bug that they do not) and their implementation is a bit complicated. This separates out each of those jobs into a separate method to make management of those separate (but related) tasks a bit clearer. This should make the code a little easier to change, which will probably happen next because I think I want to remove `WaybackSession` entirely.

I think we actually broke this functionality somewhere, so it needs a test

Mr0grog

OK, I went ahead and moved the retry/rate limiting functionality down to the adapter layer, which effectively makes WaybackHttpAdapter a replacement for WaybackSession. We could keep the old name, but I think the new name is good and helps create distance from requests.Session (and also makes it clear that all the other Session functionality from requests that someone could have relied on is no longer there).

I also tried to cover my discomfort with all the retry/rate limit stuff being complex to customize an adapter around by making it into a sort of mixin:

class RequestsAdapter(WaybackHttpAdapter):
    # Implements HTTP requests via the "requests" library.
    ...

class RetryAndRateLimitAdapter(WaybackHttpAdapter):
    # Mixin class that implements retrying and rate limiting
    # around the `adapter.request()` method.
    ...

class WaybackRequestsAdapter(RetryAndRateLimitAdapter, DisableAfterCloseAdapter, RequestsAdapter):
    # The actual adapter we want to use.
    ...

That’s a little complicated! But at least it means someone can leverage the existing retry and rate limiting functionality but plug in a different HTTP implementation. I think that’s good?

Anyway, there’s a lot of little cleanup stuff that has to happen here (also docs!!!!). But I think the overall shape of things is more-or-less what we want now.

Mr0grog · 2024-01-05T00:14:37Z

src/wayback/_http.py

+    @property
+    def is_memento(self) -> bool:
+        return 'Memento-Datetime' in self.headers


This property is not as generic as everything else in this class (which sticks to “HTTP” as the level of abstraction), but it was very helpfully convenient to add here. If we wanted to be more strict we could pull it back out into a utility function.

Mr0grog · 2024-01-05T00:24:01Z

src/wayback/_http.py

+class RequestsAdapter(WaybackHttpAdapter):
+    """
+    Wrap the requests library for making HTTP requests.
+    """


We now have two similarly named classes, which might be kind of dangerous:

RequestsAdapter is private and just implements HTTP requests with the requests library. I wonder if it should have an underscore prefix or be renamed in some bigger way for clarity. (I do think keeping it separate is good, since it will get way more complex when we add thread safety in the next PR.)

WaybackRequestsAdapter adds retry, rate limiting, etc. to the above class. It is the actual adapter that gets used. I’m also wondering if this should be renamed to something more generic or have a generic alias like WaybackDefaultAdapter. That way the public name is something people can rely on, so they don’t break if/when we swap out requests for raw urllib3 or httpx.

Mr0grog · 2024-01-05T00:24:29Z

src/wayback/_http.py

+    # XXX: Some of these are requests-specific and should move WaybackRequestsAdapter.
+    retryable_errors = (ConnectTimeoutError, MaxRetryError, ReadTimeoutError,
+                        ProxyError, RetryError, Timeout)
+    # Handleable errors *may* be retryable, but need additional logic beyond
+    # just the error type. See `should_retry_error()`.
+    handleable_errors = (ConnectionError,) + retryable_errors


These almost certainly need to move to either RequestsAdapter or WaybackRequestsAdapter in order for this to stay generic.

Mr0grog · 2024-01-05T00:38:47Z

src/wayback/_http.py

+    # The implementation of different features is split up by method here, so
+    # `request()` calls down through a stack of overrides:
+    #
+    #   request                        (ensure valid types/values/etc.)
+    #   └> _request_redirectable       (handle redirects)
+    #      └> _request_retryable       (handle retries for errors)
+    #         └> _request_rate_limited (handle rate limiting/throttling)
+    #            └> request_raw        (handle actual HTTP)
+    def request(


This breaks the class up and makes its definition longer, but hopefully it makes it a little more manageable than the way things were done in WaybackSession.

These could also be implemented as totally separate adapters, which might be cleaner? But it felt simpler to just keep them together for now.

This was also partly influenced by my realization that we have a bug where we don’t implement our own redirect following, which means redirects don’t get rate limited (sigh). It this gives us a good place to slide that complexity in later (_request_redirectable() is a pass-through in this PR).

Mr0grog · 2024-01-05T00:48:30Z

src/wayback/_http.py

+class WaybackRequestsAdapter(RetryAndRateLimitAdapter, DisableAfterCloseAdapter, RequestsAdapter):
+    def __init__(
+        self,
+        retries: int = 6,
+        backoff: Real = 2,
+        timeout: Union[Real, Tuple[Real, Real]] = 60,
+        user_agent: str = None,
+        memento_rate_limit: Union[RateLimit, Real] = DEFAULT_MEMENTO_RATE_LIMIT,
+        search_rate_limit: Union[RateLimit, Real] = DEFAULT_CDX_RATE_LIMIT,
+        timemap_rate_limit: Union[RateLimit, Real] = DEFAULT_TIMEMAP_RATE_LIMIT,
+    ) -> None:
+        super().__init__(
+            retries=retries,
+            backoff=backoff,
+            rate_limits={
+                '/web/timemap': RateLimit.make_limit(timemap_rate_limit),
+                '/cdx': RateLimit.make_limit(search_rate_limit),
+                # The memento limit is actually a generic Wayback limit.
+                '/': RateLimit.make_limit(memento_rate_limit),
+            }
+        )
+        self.timeout = timeout
+        self.headers = {
+            'User-Agent': (user_agent or
+                           f'wayback/{__version__} (+https://github.com/edgi-govdata-archiving/wayback)'),
+            'Accept-Encoding': 'gzip, deflate'
+        }


Now that I’m restructuring things, I wonder a bit whether the defaults for user_agent and the rate limits belong here, or in WaybackClient. As it is, someone wanting to provide a custom adapter needs to make sure they set the right defaults, which are very much tied to “correct” functioning of the overall library.

OTOH, keeping them here keeps configuration simple — you don’t really provide options to WaybackClient other than an instance of WaybackHttpAdapter, and all the various config options live on the adapter.

Mr0grog · 2024-01-05T00:50:20Z

src/wayback/_http.py

+    ) -> WaybackHttpResponse:
+        timeout = self.timeout if timeout is -1 else timeout
+        headers = dict(headers or {}, **self.headers)


Wondering if these somehow belong on the base WaybackHttpAdapter rather than here. 🤷

Mr0grog · 2024-01-05T00:51:15Z

src/wayback/_utils.py

@@ -322,26 +320,25 @@ def __exit_all__(self, type, value, traceback):
        pass


-class DisableAfterCloseSession(requests.Session):
+class DisableAfterCloseAdapter:


I wound up keeping this but changing the implementation slightly (so it mixes in kind of like RetryAndRateLimitAdapter does). It should probably move to _http.py.

Mr0grog · 2024-01-05T00:52:40Z

src/wayback/exceptions.py

-class SessionClosedError(Exception):
+class AlreadyClosedError(Exception):
    """
-    Raised when a Wayback session is used to make a request after it has been
+    Raised when a Wayback client is used to make a request after it has been


It didn’t make sense to keep “session” in the name, but instead of replacing it with “adapter” I decided to go with making it a little more generic.

Mr0grog · 2024-01-05T00:54:17Z

src/wayback/tests/test_client.py

+    @mock.patch('requests.Session.send', side_effect=return_user_agent)
+    def test_user_agent(self, mock_class):
+        adapter = WaybackRequestsAdapter()
+        agent = adapter.request('GET', 'http://test.com').text
+        assert agent.startswith('wayback/')


I feel lucky I even noticed we needed this! We probably should have had a test before — the previous implementation was subtle because it used built-in functionality of requests.Session that changed when we stopped subclassing that.

Mr0grog added 8 commits December 26, 2023 09:33

Create new _http module, move hacks to it

bdbdc06

Move all *requests* stuff (except errors) to new adapter

18527aa

Update docs for WaybackSession

01e6d83

Add InternalHttpResponse to provide a non-requests Response interface

1eb3e17

This also provides some basic (but not yet fully implemented) thread-safety for response objects.

Delegate to requests for encoding sniffing

ccc8f40

Also re-organize methods a bit

Fix locking and caching issues in close()

0f3ef22

Split response API from reqeusts-specific implementation

172e6a6

Add docs and type annotations to adapters

4843ed4

Mr0grog commented Dec 30, 2023

View reviewed changes

Mr0grog added 7 commits January 2, 2024 11:32

Clarify redirect_url docs

de28981

Rename allow_redirects -> follow_redirects

f2eb7e5

Limit noqa to just the unused import rule

086e034

Add user agent test

2f48476

I think we actually broke this functionality somewhere, so it needs a test

Move all WaybackSession functionality to adapters

f43ba0a

Forgot the dict merge operator is Python 3.9+

0ec7811

Mr0grog commented Jan 5, 2024

View reviewed changes

Mr0grog added 2 commits January 4, 2024 17:18

Delegate default header handling to requests

cf6d879

Oops, stream should be true

994c8b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `requests` from public interfaces #157

Remove `requests` from public interfaces #157

Mr0grog commented Dec 30, 2023

Mr0grog Dec 30, 2023

Mr0grog Dec 30, 2023

Mr0grog Dec 30, 2023

Mr0grog Dec 30, 2023

Mr0grog Dec 30, 2023

Mr0grog left a comment

Mr0grog Jan 5, 2024

Mr0grog Jan 5, 2024

Mr0grog Jan 5, 2024

Mr0grog Jan 5, 2024

Mr0grog Jan 5, 2024

Mr0grog Jan 5, 2024

Mr0grog Jan 5, 2024

Mr0grog Jan 5, 2024

Mr0grog Jan 5, 2024

Remove requests from public interfaces #157

Are you sure you want to change the base?

Remove requests from public interfaces #157

Conversation

Mr0grog commented Dec 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mr0grog left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Remove `requests` from public interfaces #157

Remove `requests` from public interfaces #157