-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚡️ Indexd bulk fetching #448
Conversation
f2da078
to
895e10d
Compare
dataservice/api/common/pagination.py
Outdated
dids = [gf[0] for gf in gfs] | ||
indexd.prefetch(dids) | ||
|
||
prefetch_indexd() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it always prefetches twice without needing to. I think you only need the one down inside of while (pager.total > 0 and refresh)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think this is unnecessarily fetching twice, though the first still has to be there so that it fetches before pulling in from the database when the Pagination
object is created.
dataservice/api/common/pagination.py
Outdated
@@ -112,6 +124,8 @@ def indexd_pagination(q, after, limit): | |||
next_after = keep[-1].created_at if len(keep) > 0 else after | |||
# Number of results needed to fulfill the original limit | |||
remain = limit - len(keep) | |||
|
|||
prefetch_indexd() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this might not be prefetching the right dids here because prefetch_indexd
absorbs the value of after
given to indexd_pagination
and doesn't track next_after
if self.url is None or self.bulk_url is None: | ||
return | ||
print(self.url, self.bulk_url) | ||
self.page_cache = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearing the cache here means that the looped calls to indexd.prefetch inside of indexd_pagination will step on each other. Is that a problem? It seems like it would be better to clear the cache at the beginning of indexd_pagination instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that does seem to be an issue. I feel a little uneasy about clearing the cache outside of the indexd client though.
# If running in dev mode, don't call indexd | ||
if self.url is None or self.bulk_url is None: | ||
return | ||
print(self.url, self.bulk_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print?
cbd9207
to
d010289
Compare
indexd.prefetch(dids) | ||
|
||
indexd.clear_cache() | ||
prefetch_indexd(after) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if indexd bulk fetch were to happen after the Pagination object is initialized instead of before, then you could also speed up empty returns by not checking indexd this first time when total is 0. Is that doable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah there's probably a couple optimizations to be made around reducing the number of db queries here. It would involve adding some sort of branching into the Pagination object, though, and I'd prefer to keep it simple as it's generalized to all entities at the moment.
Right now, if the Pagination object were constructed first, it would result in the old behavior of constructing each object with its own request to indexd. Note that if there is a total=0, indexd won't actually be called as the query for gfs will return empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. All clear from me, then.
I bet the timing numbers are even better now. |
@@ -101,25 +102,41 @@ def indexd_pagination(q, after, limit): | |||
|
|||
:returns: A Pagination object | |||
""" | |||
def prefetch_indexd(after): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm indexd_pagination is confusing to me now.. What I would think you want is:
- Execute the query to get all of the objs for the page like you normally would with Pagination - without executing the request to indexd per instantiation of an indexd model.
- Collect the dids from the objs in the page of results
- Send the bulk load request out to indexd to get the page of indexd docs
- Iterate over the indexd docs and merge with the query results from step 1
- Return the results
I realize you'd have to refactor (maybe remove merge_indexd from constructor) and decouple things quite a bit (separate the indexd get request from the actual merging of an indexd doc with an indexd model instance). So maybe that's why you didn't do it this way.
But if you were able to implement the above, then you could probably also get rid of the entire while loop that checks for deleted files right? You would just be able to return the page right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but it would require decoupling the indexd property loading on construction of the object. Perhaps you could load the indexd properties only when you attempt to access them, but I'm not sure if that is possible.
# Number of results needed to fulfill the original limit | ||
remain = limit - len(keep) | ||
pager = Pagination(q, next_after, remain) | ||
|
||
for st in pager.items: | ||
if hasattr(st, 'was_deleted') and st.was_deleted: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this never going to happen now? Before the loop, you first populate the indexd page cache. Then you construct the Pagination obj, which results in a bunch of calls to indexd.get() which results in a bunch of indexd page cache lookups which means the requests to indexd never go out. So then you won't know if a file was deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will still happen because the merge_indexd()
is still called for every object in the page, only, it reads from the cache if available. If it's not in the cache, it will attempt to get it from indexd, in which case, it will return 404 and mark it as was_deleted
.
Will look more closely at this to confirm.
Is this going to be finalized? |
d010289
to
33be6ba
Compare
I think it could be refactored but it doesn't need to be. I think the last comment I made still needs to be addressed tho. @dankolbman was going to confirm:
|
Moving to backlog. |
Need to release latest gen3 into production before release.
Using local indexd
Single request per entity in result
Using page prefetching
Fixes #442