Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚡️ Indexd bulk fetching #448

Closed
wants to merge 2 commits into from
Closed

⚡️ Indexd bulk fetching #448

wants to merge 2 commits into from

Conversation

dankolbman
Copy link
Contributor

@dankolbman dankolbman commented Sep 21, 2018

Need to release latest gen3 into production before release.

Using local indexd
Single request per entity in result

Getting 15719 results from /genomic-files in size of 100
Expected requests: 157
150/157
Total Requests: 158
Total Elapsed time: 232.80341982841492
Average Resp time: 1.4707968987341766

Using page prefetching

Getting 15719 results from /genomic-files in size of 100
Expected requests: 157
150/157
Total Requests: 158
Total Elapsed time: 32.46433401107788
Average Resp time: 0.2029419873417721

Fixes #442

@dankolbman dankolbman added feature New functionality refactor Something needs to be done better labels Sep 21, 2018
@dankolbman dankolbman self-assigned this Sep 21, 2018
@dankolbman dankolbman force-pushed the indexd-batch branch 6 times, most recently from f2da078 to 895e10d Compare September 24, 2018 17:02
@dankolbman dankolbman requested a review from fiendish October 10, 2018 13:40
dids = [gf[0] for gf in gfs]
indexd.prefetch(dids)

prefetch_indexd()
Copy link
Contributor

@fiendish fiendish Oct 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it always prefetches twice without needing to. I think you only need the one down inside of while (pager.total > 0 and refresh)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this is unnecessarily fetching twice, though the first still has to be there so that it fetches before pulling in from the database when the Pagination object is created.

@@ -112,6 +124,8 @@ def indexd_pagination(q, after, limit):
next_after = keep[-1].created_at if len(keep) > 0 else after
# Number of results needed to fulfill the original limit
remain = limit - len(keep)

prefetch_indexd()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this might not be prefetching the right dids here because prefetch_indexd absorbs the value of after given to indexd_pagination and doesn't track next_after

if self.url is None or self.bulk_url is None:
return
print(self.url, self.bulk_url)
self.page_cache = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearing the cache here means that the looped calls to indexd.prefetch inside of indexd_pagination will step on each other. Is that a problem? It seems like it would be better to clear the cache at the beginning of indexd_pagination instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that does seem to be an issue. I feel a little uneasy about clearing the cache outside of the indexd client though.

# If running in dev mode, don't call indexd
if self.url is None or self.bulk_url is None:
return
print(self.url, self.bulk_url)
Copy link
Contributor

@fiendish fiendish Oct 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print?

@dankolbman dankolbman force-pushed the indexd-batch branch 2 times, most recently from cbd9207 to d010289 Compare October 11, 2018 14:08
indexd.prefetch(dids)

indexd.clear_cache()
prefetch_indexd(after)
Copy link
Contributor

@fiendish fiendish Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if indexd bulk fetch were to happen after the Pagination object is initialized instead of before, then you could also speed up empty returns by not checking indexd this first time when total is 0. Is that doable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there's probably a couple optimizations to be made around reducing the number of db queries here. It would involve adding some sort of branching into the Pagination object, though, and I'd prefer to keep it simple as it's generalized to all entities at the moment.

Right now, if the Pagination object were constructed first, it would result in the old behavior of constructing each object with its own request to indexd. Note that if there is a total=0, indexd won't actually be called as the query for gfs will return empty.

Copy link
Contributor

@fiendish fiendish Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. All clear from me, then.

@fiendish
Copy link
Contributor

fiendish commented Oct 11, 2018

I bet the timing numbers are even better now.

@@ -101,25 +102,41 @@ def indexd_pagination(q, after, limit):

:returns: A Pagination object
"""
def prefetch_indexd(after):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm indexd_pagination is confusing to me now.. What I would think you want is:

  1. Execute the query to get all of the objs for the page like you normally would with Pagination - without executing the request to indexd per instantiation of an indexd model.
  2. Collect the dids from the objs in the page of results
  3. Send the bulk load request out to indexd to get the page of indexd docs
  4. Iterate over the indexd docs and merge with the query results from step 1
  5. Return the results

I realize you'd have to refactor (maybe remove merge_indexd from constructor) and decouple things quite a bit (separate the indexd get request from the actual merging of an indexd doc with an indexd model instance). So maybe that's why you didn't do it this way.

But if you were able to implement the above, then you could probably also get rid of the entire while loop that checks for deleted files right? You would just be able to return the page right away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but it would require decoupling the indexd property loading on construction of the object. Perhaps you could load the indexd properties only when you attempt to access them, but I'm not sure if that is possible.

# Number of results needed to fulfill the original limit
remain = limit - len(keep)
pager = Pagination(q, next_after, remain)

for st in pager.items:
if hasattr(st, 'was_deleted') and st.was_deleted:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this never going to happen now? Before the loop, you first populate the indexd page cache. Then you construct the Pagination obj, which results in a bunch of calls to indexd.get() which results in a bunch of indexd page cache lookups which means the requests to indexd never go out. So then you won't know if a file was deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will still happen because the merge_indexd() is still called for every object in the page, only, it reads from the cache if available. If it's not in the cache, it will attempt to get it from indexd, in which case, it will return 404 and mark it as was_deleted.
Will look more closely at this to confirm.

@fiendish
Copy link
Contributor

Is this going to be finalized?

@dankolbman
Copy link
Contributor Author

@fiendish this is done from my POV. Waiting for more feedback from @znatty22 on if it needs to be refactored to be more understandable.

@znatty22
Copy link
Member

znatty22 commented Dec 3, 2018

I think it could be refactored but it doesn't need to be. I think the last comment I made still needs to be addressed tho. @dankolbman was going to confirm:

I believe this will still happen because the merge_indexd() is still called for every object in the page, only, it reads from the cache if available. If it's not in the cache, it will attempt to get it from indexd, in which case, it will return 404 and mark it as was_deleted.
Will look more closely at this to confirm.

@baileyckelly
Copy link
Contributor

Moving to backlog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality refactor Something needs to be done better
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bulk request documents from indexd
4 participants