⚡️ Indexd bulk fetching #448

dankolbman · 2018-09-21T19:59:53Z

Need to release latest gen3 into production before release.

Using local indexd
Single request per entity in result

Getting 15719 results from /genomic-files in size of 100
Expected requests: 157
150/157
Total Requests: 158
Total Elapsed time: 232.80341982841492
Average Resp time: 1.4707968987341766

Using page prefetching

Getting 15719 results from /genomic-files in size of 100
Expected requests: 157
150/157
Total Requests: 158
Total Elapsed time: 32.46433401107788
Average Resp time: 0.2029419873417721

Fixes #442

fiendish · 2018-10-10T18:12:04Z

dataservice/api/common/pagination.py

+        dids = [gf[0] for gf in gfs]
+        indexd.prefetch(dids)
+
+    prefetch_indexd()


This looks like it always prefetches twice without needing to. I think you only need the one down inside of while (pager.total > 0 and refresh)?

Yeah I think this is unnecessarily fetching twice, though the first still has to be there so that it fetches before pulling in from the database when the Pagination object is created.

fiendish · 2018-10-10T18:15:21Z

dataservice/api/common/pagination.py

@@ -112,6 +124,8 @@ def indexd_pagination(q, after, limit):
        next_after = keep[-1].created_at if len(keep) > 0 else after
        # Number of results needed to fulfill the original limit
        remain = limit - len(keep)
+
+        prefetch_indexd()


It looks like this might not be prefetching the right dids here because prefetch_indexd absorbs the value of after given to indexd_pagination and doesn't track next_after

fiendish · 2018-10-10T18:27:31Z

dataservice/extensions/flask_indexd.py

+        if self.url is None or self.bulk_url is None:
+            return
+        print(self.url, self.bulk_url)
+        self.page_cache = {}


Clearing the cache here means that the looped calls to indexd.prefetch inside of indexd_pagination will step on each other. Is that a problem? It seems like it would be better to clear the cache at the beginning of indexd_pagination instead.

Yeah, that does seem to be an issue. I feel a little uneasy about clearing the cache outside of the indexd client though.

fiendish · 2018-10-10T18:47:06Z

dataservice/extensions/flask_indexd.py

+        # If running in dev mode, don't call indexd
+        if self.url is None or self.bulk_url is None:
+            return
+        print(self.url, self.bulk_url)


fiendish · 2018-10-11T14:58:11Z

dataservice/api/common/pagination.py

+        indexd.prefetch(dids)
+
+    indexd.clear_cache()
+    prefetch_indexd(after)


if indexd bulk fetch were to happen after the Pagination object is initialized instead of before, then you could also speed up empty returns by not checking indexd this first time when total is 0. Is that doable?

Yeah there's probably a couple optimizations to be made around reducing the number of db queries here. It would involve adding some sort of branching into the Pagination object, though, and I'd prefer to keep it simple as it's generalized to all entities at the moment.

Right now, if the Pagination object were constructed first, it would result in the old behavior of constructing each object with its own request to indexd. Note that if there is a total=0, indexd won't actually be called as the query for gfs will return empty.

got it. All clear from me, then.

fiendish · 2018-10-11T15:02:25Z

I bet the timing numbers are even better now.

znatty22 · 2018-10-11T17:18:53Z

dataservice/api/common/pagination.py

@@ -101,25 +102,41 @@ def indexd_pagination(q, after, limit):

    :returns: A Pagination object
    """
+    def prefetch_indexd(after):


Hmm indexd_pagination is confusing to me now.. What I would think you want is:

Execute the query to get all of the objs for the page like you normally would with Pagination - without executing the request to indexd per instantiation of an indexd model.

Collect the dids from the objs in the page of results

Send the bulk load request out to indexd to get the page of indexd docs

Iterate over the indexd docs and merge with the query results from step 1

Return the results

I realize you'd have to refactor (maybe remove merge_indexd from constructor) and decouple things quite a bit (separate the indexd get request from the actual merging of an indexd doc with an indexd model instance). So maybe that's why you didn't do it this way.

But if you were able to implement the above, then you could probably also get rid of the entire while loop that checks for deleted files right? You would just be able to return the page right away.

That's true, but it would require decoupling the indexd property loading on construction of the object. Perhaps you could load the indexd properties only when you attempt to access them, but I'm not sure if that is possible.

znatty22 · 2018-10-11T18:15:13Z

dataservice/api/common/pagination.py

-        # Number of results needed to fulfill the original limit
-        remain = limit - len(keep)
-        pager = Pagination(q, next_after, remain)
-
        for st in pager.items:
            if hasattr(st, 'was_deleted') and st.was_deleted:


Isn't this never going to happen now? Before the loop, you first populate the indexd page cache. Then you construct the Pagination obj, which results in a bunch of calls to indexd.get() which results in a bunch of indexd page cache lookups which means the requests to indexd never go out. So then you won't know if a file was deleted?

I believe this will still happen because the merge_indexd() is still called for every object in the page, only, it reads from the cache if available. If it's not in the cache, it will attempt to get it from indexd, in which case, it will return 404 and mark it as was_deleted.
Will look more closely at this to confirm.

fiendish · 2018-11-28T17:02:04Z

Is this going to be finalized?

dankolbman · 2018-11-28T17:04:34Z

@fiendish this is done from my POV. Waiting for more feedback from @znatty22 on if it needs to be refactored to be more understandable.

znatty22 · 2018-12-03T15:13:58Z

I think it could be refactored but it doesn't need to be. I think the last comment I made still needs to be addressed tho. @dankolbman was going to confirm:

I believe this will still happen because the merge_indexd() is still called for every object in the page, only, it reads from the cache if available. If it's not in the cache, it will attempt to get it from indexd, in which case, it will return 404 and mark it as was_deleted.
Will look more closely at this to confirm.

baileyckelly · 2019-01-10T17:00:21Z

Moving to backlog.

dankolbman added feature New functionality refactor Something needs to be done better labels Sep 21, 2018

dankolbman self-assigned this Sep 21, 2018

dankolbman requested review from znatty22 and parimalak September 21, 2018 19:59

dankolbman force-pushed the indexd-batch branch 6 times, most recently from f2da078 to 895e10d Compare September 24, 2018 17:02

dankolbman requested a review from fiendish October 10, 2018 13:40

fiendish suggested changes Oct 10, 2018

View reviewed changes

fiendish reviewed Oct 10, 2018

View reviewed changes

dankolbman force-pushed the indexd-batch branch 2 times, most recently from cbd9207 to d010289 Compare October 11, 2018 14:08

dankolbman requested a review from liberaliscomputing October 11, 2018 14:42

fiendish reviewed Oct 11, 2018

View reviewed changes

fiendish approved these changes Oct 11, 2018

View reviewed changes

znatty22 reviewed Oct 11, 2018

View reviewed changes

dankolbman added 2 commits November 28, 2018 12:03

✨ Add user agent to requests headers

d3b21d5

⚡ Prefetch indexd documents in bulk

33be6ba

dankolbman force-pushed the indexd-batch branch from d010289 to 33be6ba Compare November 28, 2018 17:03

dankolbman mentioned this pull request Mar 22, 2019

I believe we're using gen3/indexd wrong #487

Open

dankolbman closed this Nov 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Indexd bulk fetching #448

⚡️ Indexd bulk fetching #448

dankolbman commented Sep 21, 2018 •

edited

Loading

fiendish Oct 10, 2018 •

edited

Loading

dankolbman Oct 11, 2018

fiendish Oct 10, 2018

fiendish Oct 10, 2018

dankolbman Oct 11, 2018

fiendish Oct 10, 2018 •

edited

Loading

fiendish Oct 11, 2018 •

edited

Loading

dankolbman Oct 11, 2018

fiendish Oct 11, 2018 •

edited

Loading

fiendish commented Oct 11, 2018 •

edited

Loading

znatty22 Oct 11, 2018

dankolbman Oct 11, 2018

znatty22 Oct 11, 2018

dankolbman Oct 11, 2018

fiendish commented Nov 28, 2018

dankolbman commented Nov 28, 2018

znatty22 commented Dec 3, 2018

baileyckelly commented Jan 10, 2019

⚡️ Indexd bulk fetching #448

⚡️ Indexd bulk fetching #448

Conversation

dankolbman commented Sep 21, 2018 • edited Loading

Need to release latest gen3 into production before release.

fiendish Oct 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiendish Oct 10, 2018 • edited Loading

Choose a reason for hiding this comment

fiendish Oct 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiendish Oct 11, 2018 • edited Loading

Choose a reason for hiding this comment

fiendish commented Oct 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiendish commented Nov 28, 2018

dankolbman commented Nov 28, 2018

znatty22 commented Dec 3, 2018

baileyckelly commented Jan 10, 2019

dankolbman commented Sep 21, 2018 •

edited

Loading

fiendish Oct 10, 2018 •

edited

Loading

fiendish Oct 10, 2018 •

edited

Loading

fiendish Oct 11, 2018 •

edited

Loading

fiendish Oct 11, 2018 •

edited

Loading

fiendish commented Oct 11, 2018 •

edited

Loading