Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EXP] add a prefetch linear search function to Index #1371

Merged
merged 154 commits into from
Apr 22, 2021

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Mar 6, 2021

NOTE: PR into #1370.

This PR adds:

See comment on #1370 for motivation.

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@ctb
Copy link
Contributor Author

ctb commented Mar 6, 2021

Fun fact!! When turning sourmash gather --prefetch on by default, almost all of the tests pass!

The two that don't pass are -

  • test_gather_csv, because I don't properly track where the signature came from (ref Would a "Directory" Index be useful? #810 (comment))
  • gather_query_downsample because, unlike in normal gather, I don't downsample the query appropriately with prefetch. This is actually questionable behavior to my thinking. I added a new test that checks for what I think is the right behavior, and left this one failing on this PR for now.

@ctb ctb mentioned this pull request Mar 7, 2021
7 tasks
@ctb
Copy link
Contributor Author

ctb commented Mar 9, 2021

Legend has it that @luizirber will soon come place a comment on here about how prefetch(...) is Index.find with search_containment as the search function.

This appears to be accurate, with one exception: the find(...) function is currently not a generator, but prefetch is. I change find to a generator in 70168f1, and will refactor accordingly.

@luizirber
Copy link
Member

Legend has it that @luizirber will soon come place a comment on here about how prefetch(...) is Index.find with search_containment as the search function.

This appears to be accurate, with one exception: the find(...) function is currently not a generator, but prefetch is. I change find to a generator in 70168f1, and will refactor accordingly.

I not sure they are the same same, but it sure feels like prefetch, Index.find, Index.filter and Index.select are very similar. Can we implement them in terms of each other in the base Index class? This way it is available for each implementation to override with a more efficient version, but we avoid having to implement all of them for each index implementation.

@ctb
Copy link
Contributor Author

ctb commented Mar 12, 2021

I not sure they are the same same, but it sure feels like prefetch, Index.find, Index.filter and Index.select are very similar. Can we implement them in terms of each other in the base Index class? This way it is available for each implementation to override with a more efficient version, but we avoid having to implement all of them for each index implementation.

absolutely agree (I assume you mean prefetch/find and filter/select).

ctb and others added 4 commits April 22, 2021 02:17
* have the 'find' function for SBTs return signatures
* fix majority of tests
* split find and _find_nodes to take different kinds of functions
* redo 'find' on index
* refactor lca_db to use new find
* refactor SBT to use new find
* refactor out common code
* use 'passes' properly
* adjust tree downsampling for regular minhashes, too
* remove now-unused search functions in sbtmh
* refactor categorize to use new find
* fix jaccard calculation in sbt
* check for compatibility of search fn and query signature
* switch tests over to jaccard similarity, not containment
* remove test for unimplemented LCA_Database.find method
* document threshold change; update test
* refuse to run abund signatures
* flatten sigs internally for gather
* reinflate abundances for saving
* fix problem where sbt indices coudl be created with abund signatures
* split flat and abund search
* make ignore_abundance work again for categorize
* turn off best-only, since it triggers on self-hits.
* add test: 'sourmash index' flattens sigs
* location is now a property
* move search code into search.py
* remove redundant scaled checking code
* best-only now works properly for two tests
* 'fix' tests by removing v1 and v2 SBT compatibility
* simplify downsampling code
* require keyword args in MinHash.downsample(...)
* fix test to use proper downsampling, reverse order to match scaled
* flatten subject MinHash, too
* add IndexSearchResult namedtuple for search and gather results
* add more tests for Index classes
* add tests for subj & query num downsampling
* tests for Index.search_abund
* refactor make_jaccard_search_query; start tests
* test collect, best_only
* deal with status == None on SystemExit
* upgrade and simplify categorize
* fix abundance search in SBT for categorize
* add explicit test for incompatible num
* add simple tests for SBT load and search API
* allow arbitrary kwargs for LCA_DAtabase.find
* add testing of passthru-kwargs
* docstring updates
* better tests for gather --save-unassigned
* SBT search doesn't work on v1 and v2 SBTs b/c no min_n_below
* add intersection_and_union_size method to MinHash
* make flatten a no-op if track_abundance=False
* intersection_union_size in the FFI

Co-authored-by: Luiz Irber <[email protected]>
@ctb ctb merged commit 67e7954 into add/prefetch_cli Apr 22, 2021
@ctb ctb deleted the add/prefetch_index branch April 22, 2021 19:22
@ctb
Copy link
Contributor Author

ctb commented Apr 22, 2021

merged into #1370.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants