refactor: Make SimpleRetriever thread-safe so that different partitions can share the same SimpleRetriever #185
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What
Right now, because the SimpleRetriever has an internally managed and modified state, we cannot have multiple partitions running at the same time share the same SimpleRetriever. Otherwise we might run into data loss because the state of the SimpleRetriever (things like page number, etc) are modified by different partitions. We've seen this problem arise in some connections (link issue) and we have temporarily solved this by having every Partition instantiate it's own
SimpleRetriever
.This however is very inefficient for a couple reasons like needing to perform auth for every partition (link issue). And this is a hard blocker on AsyncRetriever which must be shared across partitions in order to manage the shared job repository.
This PR replaces an internal state for
SimpleRetriever
and all of its dependencies which are manged via atoken
field and makes all methods stateless by relying on passing parameterized values for thenext_page_token
instead.How
Refactor retrievers, paginators, and pagination strategy to be stateless