-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove requests
from public interfaces
#157
base: main
Are you sure you want to change the base?
Conversation
This also provides some basic (but not yet fully implemented) thread-safety for response objects.
Also re-organize methods a bit
src/wayback/_client.py
Outdated
# Customize the built-in `send()` with retryability and rate limiting. | ||
def send(self, request: requests.PreparedRequest, **kwargs): | ||
def request(self, method, url, timeout=-1, **kwargs) -> WaybackHttpResponse: | ||
super().request() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This super()
call is really just to keep _utils.DisableAfterCloseSession
working, and I think the reality here is that we should probably just get rid of that utility. It provides some nice guardrails, but they don’t offer a huge amount of value.
src/wayback/_client.py
Outdated
def reset(self): | ||
"Reset any network connections the session is using." | ||
# Close really just closes all the adapters in `self.adapters`. We | ||
# could do that directly, but `self.adapters` is not documented/public, | ||
# so might be somewhat risky. | ||
self.close(disable=False) | ||
# Re-build the standard adapters. See: | ||
# https://github.com/kennethreitz/requests/blob/v2.22.0/requests/sessions.py#L415-L418 | ||
self.mount('https://', requests.adapters.HTTPAdapter()) | ||
self.mount('http://', requests.adapters.HTTPAdapter()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, I added this to help with some problems we were having managing connections in BIG jobs at EDGI, and that were later solved with better fixes. It’s kind of a weird method, and its presence is even weirder with the refactoring here. I think I should just drop it entirely.
timeout : int or float or tuple of (int or float, int or float), optional | ||
How long to wait, in seconds, before timing out. If this is a single | ||
number, it will be used as both a connect timeout (how long to wait | ||
for the first bit of response data) and a read timeout (how long | ||
to wait between bits of response data). If a tuple, the values will | ||
be used as the connect and read timeouts, respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m a little worried this is too narrowly designed around how requests does it. It's probably not a huge deal, though, and keeps the timeout interface the same-ish as it is today.
assert response.ok | ||
assert response.is_success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is a bit arbitrary, but I was inspired by httpx's reasoning for the same change: https://www.python-httpx.org/compatibility/#checking-for-success-and-failure-responses
Considering whether we should also do this in the Memento
class (maybe just a deprecation of .ok
there, though).
class WaybackHttpResponse: | ||
""" | ||
Represents an HTTP response from a server. This might be included as an | ||
attribute of an exception, but should otherwise not be exposed to user | ||
code in normal circumstances. It's meant to wrap to provide a standard, | ||
thread-safe interface to response objects from whatever underlying HTTP | ||
tool is being used (e.g. requests, httpx, etc.). | ||
""" | ||
status_code: int | ||
headers: dict | ||
encoding: Optional[str] = None | ||
url: str | ||
links: dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if this class should move to _models
, especially if we make customizing WaybackHttpAdapter
a thing. (Since you’ll need to make your own subclass of this to return from an adapter that doesn’t use Requests.)
Sessions currently handle rate limiting and retries (and *need* to handle redirects so those get rate limited; it's a bug that they do not) and their implementation is a bit complicated. This separates out each of those jobs into a separate method to make management of those separate (but related) tasks a bit clearer. This should make the code a little easier to change, which will probably happen next because I think I want to remove `WaybackSession` entirely.
I think we actually broke this functionality somewhere, so it needs a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I went ahead and moved the retry/rate limiting functionality down to the adapter layer, which effectively makes WaybackHttpAdapter
a replacement for WaybackSession
. We could keep the old name, but I think the new name is good and helps create distance from requests.Session
(and also makes it clear that all the other Session functionality from requests
that someone could have relied on is no longer there).
I also tried to cover my discomfort with all the retry/rate limit stuff being complex to customize an adapter around by making it into a sort of mixin:
class RequestsAdapter(WaybackHttpAdapter):
# Implements HTTP requests via the "requests" library.
...
class RetryAndRateLimitAdapter(WaybackHttpAdapter):
# Mixin class that implements retrying and rate limiting
# around the `adapter.request()` method.
...
class WaybackRequestsAdapter(RetryAndRateLimitAdapter, DisableAfterCloseAdapter, RequestsAdapter):
# The actual adapter we want to use.
...
That’s a little complicated! But at least it means someone can leverage the existing retry and rate limiting functionality but plug in a different HTTP implementation. I think that’s good?
Anyway, there’s a lot of little cleanup stuff that has to happen here (also docs!!!!). But I think the overall shape of things is more-or-less what we want now.
@property | ||
def is_memento(self) -> bool: | ||
return 'Memento-Datetime' in self.headers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This property is not as generic as everything else in this class (which sticks to “HTTP” as the level of abstraction), but it was very helpfully convenient to add here. If we wanted to be more strict we could pull it back out into a utility function.
class RequestsAdapter(WaybackHttpAdapter): | ||
""" | ||
Wrap the requests library for making HTTP requests. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now have two similarly named classes, which might be kind of dangerous:
-
RequestsAdapter
is private and just implements HTTP requests with the requests library. I wonder if it should have an underscore prefix or be renamed in some bigger way for clarity. (I do think keeping it separate is good, since it will get way more complex when we add thread safety in the next PR.) -
WaybackRequestsAdapter
adds retry, rate limiting, etc. to the above class. It is the actual adapter that gets used. I’m also wondering if this should be renamed to something more generic or have a generic alias likeWaybackDefaultAdapter
. That way the public name is something people can rely on, so they don’t break if/when we swap out requests for raw urllib3 or httpx.
# XXX: Some of these are requests-specific and should move WaybackRequestsAdapter. | ||
retryable_errors = (ConnectTimeoutError, MaxRetryError, ReadTimeoutError, | ||
ProxyError, RetryError, Timeout) | ||
# Handleable errors *may* be retryable, but need additional logic beyond | ||
# just the error type. See `should_retry_error()`. | ||
handleable_errors = (ConnectionError,) + retryable_errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These almost certainly need to move to either RequestsAdapter
or WaybackRequestsAdapter
in order for this to stay generic.
# The implementation of different features is split up by method here, so | ||
# `request()` calls down through a stack of overrides: | ||
# | ||
# request (ensure valid types/values/etc.) | ||
# └> _request_redirectable (handle redirects) | ||
# └> _request_retryable (handle retries for errors) | ||
# └> _request_rate_limited (handle rate limiting/throttling) | ||
# └> request_raw (handle actual HTTP) | ||
def request( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks the class up and makes its definition longer, but hopefully it makes it a little more manageable than the way things were done in WaybackSession
.
These could also be implemented as totally separate adapters, which might be cleaner? But it felt simpler to just keep them together for now.
This was also partly influenced by my realization that we have a bug where we don’t implement our own redirect following, which means redirects don’t get rate limited (sigh). It this gives us a good place to slide that complexity in later (_request_redirectable()
is a pass-through in this PR).
src/wayback/_http.py
Outdated
class WaybackRequestsAdapter(RetryAndRateLimitAdapter, DisableAfterCloseAdapter, RequestsAdapter): | ||
def __init__( | ||
self, | ||
retries: int = 6, | ||
backoff: Real = 2, | ||
timeout: Union[Real, Tuple[Real, Real]] = 60, | ||
user_agent: str = None, | ||
memento_rate_limit: Union[RateLimit, Real] = DEFAULT_MEMENTO_RATE_LIMIT, | ||
search_rate_limit: Union[RateLimit, Real] = DEFAULT_CDX_RATE_LIMIT, | ||
timemap_rate_limit: Union[RateLimit, Real] = DEFAULT_TIMEMAP_RATE_LIMIT, | ||
) -> None: | ||
super().__init__( | ||
retries=retries, | ||
backoff=backoff, | ||
rate_limits={ | ||
'/web/timemap': RateLimit.make_limit(timemap_rate_limit), | ||
'/cdx': RateLimit.make_limit(search_rate_limit), | ||
# The memento limit is actually a generic Wayback limit. | ||
'/': RateLimit.make_limit(memento_rate_limit), | ||
} | ||
) | ||
self.timeout = timeout | ||
self.headers = { | ||
'User-Agent': (user_agent or | ||
f'wayback/{__version__} (+https://github.com/edgi-govdata-archiving/wayback)'), | ||
'Accept-Encoding': 'gzip, deflate' | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I’m restructuring things, I wonder a bit whether the defaults for user_agent
and the rate limits belong here, or in WaybackClient
. As it is, someone wanting to provide a custom adapter needs to make sure they set the right defaults, which are very much tied to “correct” functioning of the overall library.
OTOH, keeping them here keeps configuration simple — you don’t really provide options to WaybackClient
other than an instance of WaybackHttpAdapter
, and all the various config options live on the adapter.
src/wayback/_http.py
Outdated
) -> WaybackHttpResponse: | ||
timeout = self.timeout if timeout is -1 else timeout | ||
headers = dict(headers or {}, **self.headers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if these somehow belong on the base WaybackHttpAdapter
rather than here. 🤷
@@ -322,26 +320,25 @@ def __exit_all__(self, type, value, traceback): | |||
pass | |||
|
|||
|
|||
class DisableAfterCloseSession(requests.Session): | |||
class DisableAfterCloseAdapter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wound up keeping this but changing the implementation slightly (so it mixes in kind of like RetryAndRateLimitAdapter
does). It should probably move to _http.py
.
class SessionClosedError(Exception): | ||
class AlreadyClosedError(Exception): | ||
""" | ||
Raised when a Wayback session is used to make a request after it has been | ||
Raised when a Wayback client is used to make a request after it has been |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It didn’t make sense to keep “session” in the name, but instead of replacing it with “adapter” I decided to go with making it a little more generic.
@mock.patch('requests.Session.send', side_effect=return_user_agent) | ||
def test_user_agent(self, mock_class): | ||
adapter = WaybackRequestsAdapter() | ||
agent = adapter.request('GET', 'http://test.com').text | ||
assert agent.startswith('wayback/') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel lucky I even noticed we needed this! We probably should have had a test before — the previous implementation was subtle because it used built-in functionality of requests.Session
that changed when we stopped subclassing that.
WaybackSession
was originally a subclass ofrequests.Session
, and that turns out to have been a poor decision (see #152 (comment)). We wanted to make customizing HTTP behavior relatively straightforward, but using a subclass-able version ofrequests.Session
wound up baking problematic parts of requests into our API. In particular, requests is not thread-safe and this setup makes it hard to wrap with thread-safe code. We also unintentionally broke the promise of easy subclassing when we added complex retry and rate limiting behavior.This is a major, breaking change designed to largely remove requests from our public API:
WaybackSession
is now its own, standalone class instead of a subclass ofrequests.Session
. It handles stuff like rate limiting, retries, etc.WaybackHttpAdapter
handles making a single HTTP request and is the interface through which we use the requests library.It has a very small API (2 methods:
.request()
sends an HTTP request;.close()
closes all its network connections).The
.request()
method returns aWaybackHttpResponse
object, which models just the parts of the response we need. TheWaybackHttpResponse
class is an abstract base class that represents the API that the rest of Wayback needs, and internally we have aWaybackRequestsResponse
subclass that wraps a response object from Requests in a thread-safe way.Ideally, the small API will make it easy for us to build an alternate implementation with urllib3, httpx, etc, or just to add thread-safe handling of requests. It might also be a better way to let users customize HTTP behavior than the old approach, although you can’t yet supply a custom subclass to
WaybackClient
orWaybackSession
.❗ Open question: I don’t think that
WaybackSession
should continue to exist. It makes the API complex and it lets the details of requests leak out a little bit through exception handling. Any feedback on this is welcome (cc @danielballan). Some ideas:Move its functionality directly into
WaybackHttpAdapter
.WaybackHttpAdapter
gets more complex because you have to work around all that machinery. It’s not as straightforward as just overriding the.request()
method. Some ways to address that:WaybackHttpAdapter._request()
method (not the underscore) that you can override if you just want to customize the sending of a single HTTP request.WaybackHttpAdapter
subclass. I think this might be a lot more work than it's worth, though.Move its functionality into
WaybackClient
.I think I’m leaning towards some version of (1) here.
This paves the way towards fixing #58, but does not actually solve the issue (
WaybackHttpAdapter
is not yet thread-safe, and needs to use some of the tricks from #23).