[MRG] re-implement the actual gather protocol with a cleaner interface. #1489

ctb · 2021-04-29T12:52:06Z

Note: This is a PR into #1370; review & merge can wait until that PR is merged.

This PR trials an actual gather protocol at the class level, as a potential alternate fix to #1263. See also comments here and above.

The core code is located here, which provides a CounterGather class that collects and prioritizes matches for gather and also supports cross-database gather.

The key piece of the refactoring is that CounterGather now provides two methods, peek(query) and consume(...). peek(query) provides the best containment result from this counter, but does not adjust any of the internal information; consume(...) is used to remove a match (potentially from a different CounterGather object).

Below is some code implementing multi-database gather that (when shoehorned into the current gather code :) passes all of the tests:

    def gather(self, query, threshold_bp):
        results = []

        best_result = None
        best_intersect_mh = None

        # find the best score across multiple counters, without consuming       
        for counter in self.counters:
            result = counter.peek(query.minhash, query.minhash.scaled,
                                  threshold_bp)
            if result:
                (sr, intersect_mh) = result

                if best_result is None or sr.score > best_result.score:
                    best_result = sr
                    best_intersect_mh = intersect_mh

        if best_result:
            # remove the best result from each counter                          
            for counter in self.counters:
                counter.consume(best_intersect_mh)

            # and done!                                                         
            return [best_result]
        return []

TODO:

test flatten and downsampling
test consume in more detail, maybe

codecov · 2021-04-29T12:57:28Z

Codecov Report

Merging #1489 (af52c41) into add/prefetch_cli (903239b) will increase coverage by 0.23%.
The diff coverage is 100.00%.

@@                 Coverage Diff                  @@
##           add/prefetch_cli    #1489      +/-   ##
====================================================
+ Coverage             94.89%   95.12%   +0.23%     
====================================================
  Files                    99       99              
  Lines                 16662    17044     +382     
  Branches               1545     1562      +17     
====================================================
+ Hits                  15811    16213     +402     
+ Misses                  616      603      -13     
+ Partials                235      228       -7

Flag	Coverage Δ
python	`95.12% <100.00%> (+0.23%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
tests/test_sourmash.py	`99.72% <ø> (-0.01%)`	⬇️
src/sourmash/commands.py	`85.21% <100.00%> (+0.12%)`	⬆️
src/sourmash/index.py	`95.22% <100.00%> (+4.43%)`	⬆️
src/sourmash/search.py	`96.42% <100.00%> (+0.48%)`	⬆️
tests/test_index.py	`100.00% <100.00%> (ø)`
src/sourmash/minhash.py	`93.33% <0.00%> (+0.63%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 903239b...af52c41. Read the comment docs.

ctb · 2021-04-29T13:24:45Z

aaaaaand just because, I implemented Index.counter_gather(...) and completely swapped out the gather implementation in favor of the new approach.

As a bonus (?), I may have discovered a bug in multigather.

ctb · 2021-04-29T13:31:05Z

Note: there are some potential cleanups and refactorings in gather_databases but I'm going to rest on my laurels at this point, and await feedback before continuing.

ctb · 2021-04-29T14:39:07Z

Some benchmarking for y'all using @bluegenes new zipfile databases -- running on farm head node 😏

These use the code above, so prefetch and CounterGather and (because zipfiles) --linear.

tl;dr:

searching 35k GTDB reps is 4 minutes and 85 MB.
searching 300k GTDB all is 24 minutes and 4 GB.
it's almost all user space, not I/O (presumably zip is doing some caching here?)

against ~35k GTDB reps database

% /usr/bin/time -v sourmash gather signatures/7a65a45d2739f50e5ecf920825482629.sig gtdb-r202.genomic-reps.k31.zip

== This is sourmash version 4.0.1.dev37+gad03e1ec. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: GCF_002950215.1 Shigella flexn... (k=31, DNA)
loaded 1 databases.

Using EXPERIMENTAL feature: prefetch enabled!

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
4.4 Mbp      100.0%  100.0%       1.1    GCF_002950215.1 Shigella flexneri 2a ...

found 1 matches total;
the recovered matches hit 100.0% of the query

        Command being timed: "sourmash gather signatures/7a65a45d2739f50e5ecf920825482629.sig gtdb-r202.genomic-reps.k31.zip"
        User time (seconds): 230.93
        System time (seconds): 2.18
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 3:51.08
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 85020
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 19
        Minor (reclaiming a frame) page faults: 33958
        Voluntary context switches: 1474
        Involuntary context switches: 301629
        Swaps: 0
        File system inputs: 2725792
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

against ~300k GTDB all database

% /usr/bin/time -v sourmash gather signatures/7a65a45d2739f50e5ecf920825482629.sig gtdb-r202.genomic.k31.zip

== This is sourmash version 4.0.1.dev37+gad03e1ec. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: GCF_002950215.1 Shigella flexn... (k=31, DNA)
loaded 1 databases.

Using EXPERIMENTAL feature: prefetch enabled!

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
4.4 Mbp      100.0%  100.0%       1.1    GCF_002950215.1 Shigella flexneri 2a ...

found 1 matches total;
the recovered matches hit 100.0% of the query

        Command being timed: "sourmash gather signatures/7a65a45d2739f50e5ecf920825482629.sig gtdb-r202.genomic.k31.zip"
        User time (seconds): 1414.67
        System time (seconds): 14.18
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 23:46.91
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4016316
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 20
        Minor (reclaiming a frame) page faults: 1611759
        Voluntary context switches: 2750
        Involuntary context switches: 546447
        Swaps: 0
        File system inputs: 16377032
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

luizirber · 2021-04-29T16:46:50Z

Nice!

A question about the usage of gather: We always pass a query argument, but the current use case doesn't modify the query (other than removing the best match). Because of this we can "optimize" the counter, which can be calculated only once and updated (because the updated query is always a subset of the original query).

Are we trying to cover use cases where the query can change in other ways, especially when it becomes something that is not a subset of the original query anymore? Technically this is supported with the current API (even tho no one does it), but more broadly it precludes the optimization I described (because the counter might be inconsistent and need to be recalculated).

ctb · 2021-04-29T16:53:54Z

Nice!

👍

A question about the usage of gather: We always pass a query argument, but the current use case doesn't modify the query (other than removing the best match). Because of this we can "optimize" the counter, which can be calculated only once and updated (because the updated query is always a subset of the original query).

Yes, and I should have an assert statement in there to make sure it's a subset of the original too..

In various refactorings (there were several that you didn't see 😆 ), I made use of the query in various ways, but I think we might be back to only making use of scaled, which can change (upwards), I think? It's not entirely clear to me even after staring at the code a whole bunch over the last few days. I would want to write a bunch of tests before committing to anything!

I like your idea of optimizing the query in that way.

Are we trying to cover use cases where the query can change in other ways, especially when it becomes something that is not a subset of the original query anymore? Technically this is supported with the current API (even tho no one does it), but more broadly it precludes the optimization I described (because the counter might be inconsistent and need to be recalculated).

Nope! I am not trying to cover use cases beyond what is already there, and definitely want to keep it as subset of original query (which I think makes perfect sense given min-set-cov understanding).

I actually want to tighten restrictions more, but this code is tricky and it took a while to get it all working right. If and when you think we should move forward with this approach, I can start writing clearer specifications and simpler tests on the Python side to tie the code down more. So, just say the word :).

ctb · 2021-04-30T14:39:40Z

ok @luizirber I think the tests are pretty thorough now, although I'd like to do a bit more white-box testing of consume if I can figure out a good way to do it.

ctb · 2021-05-01T15:37:04Z

done. planning to wait to ask for detailed review/merge until #1370 is in, however; off the top of my head, I don't see any point in making #1370 a bigger PR :).

luizirber

This looks great =]

I think it can be merged into #1370 now, even if it makes it longer to review again.

…etch` functionality. (#1370) * more refactor - filename stuff * add 'location' to SBT objects * finish removing filename * fix prefetch after merging in #1373 * implement a CounterGatherIndex * remove sort * update counter logic to remove proper intersection * make 'find' a generator * remove comment * begin refactoring 'categorize' * have the 'find' function for SBTs return signatures * fix majority of tests * comment & then fix test * torture the tests into working * split find and _find_nodes to take different kinds of functions * redo 'find' on index * refactor lca_db to use new find * refactor SBT to use new find * comment/cleanup * refactor out common code * fix up gather * use 'passes' properly * attempted cleanup * minor fixes * get a start on correct downsampling * adjust tree downsampling for regular minhashes, too * remove now-unused search functions in sbtmh * refactor categorize to use new find * cleanup and removal * remove redundant code in lca_db * remove redundant code in SBT * add notes * remove more unused code * refactor most of the test_sbt tests * fix one minor issue * fix jaccard calculation in sbt * check for compatibility of search fn and query signature * switch tests over to jaccard similarity, not containment * fix test * remove test for unimplemented LCA_Database.find method * document threshold change; update test * refuse to run abund signatures * flatten sigs internally for gather * reinflate abundances for saving * fix problem where sbt indices coudl be created with abund signatures * more * split flat and abund search * make ignore_abundance work again for categorize * turn off best-only, since it triggers on self-hits. * add test: 'sourmash index' flattens sigs * add note about something to test * fix typo; still broken tho * location is now a property * move search code into search.py * remove redundant scaled checking code * best-only now works properly for two tests * 'fix' tests by removing v1 and v2 SBT compatibility * simplify (?) downsampling code * require keyword args in MinHash.downsample(...) * fix bug with downsample * require keyword args in MinHash.downsample(...) * fix test to use proper downsampling, reverse order to match scaled * add test for revealed bug * remove unnecessary comment * flatten subject MinHash, too * add testme comment * clean up sbt find * clean up lca find * add IndexSearchResult namedtuple for search and gather results * add more tests for Index classes * add tests for subj & query num downsampling * tests for Index.search_abund * refactor a bit * refactor make_jaccard_search_query; start tests * even more tests * test collect, best_only * more search tests * remove unnec space * add minor comment * deal with status == None on SystemExit * upgrade and simplify categorize * restore test * merge * fix abundance search in SBT for categorize * code cleanup and refactoring; check for proper error messages * add explicit test for incompatible num * refactor MinHash.downsample * deal with status == None on SystemExit * fix test * fix comment mispelling * properly pass kwargs; fix search_sbt_index * add simple tests for SBT load and search API * allow arbitrary kwargs for LCA_DAtabase.find * add testing of passthru-kwargs * re-enable test * add notes to update docstrings * docstring updates * fix test * fix location reporting in prefetch * fix prefetch location by fixing MultiIndex * temporary prefetch_gather intervention * 'gather' only returns best match * turn prefetch on by default, for now * better tests for gather --save-unassigned * remove unused print * remove unnecessary check-me comment * clear out docstring * SBT search doesn't work on v1 and v2 SBTs b/c no min_n_below * start adding tests * test some basic prefetch stuff * update index for prefetch * add fairly thorough tests * fix my dumb mistake with gather * simplify, refactor, fix * fix remaining tests * propogate ValueErrors better * fix tests * flatten prefetch queries * fix for genome-grist alpha test * fix threshold bugarooni * fix gather/prefetch interactions * fix sourmash prefetch return value * minor fixes * pay proper attention to threshold * cleanup and refactoring * remove unnecessary 'scaled' * minor cleanup * added LazyLinearLindex and prefetch --linear * fix abundance problem * save matches to a directory * test for saving matches to a directory * add a flexible progressive signature output class * add tests for .sig.gz and .zip outputs * update save_signatures code; add tests; use in gather and search too * update comment * cleanup and refactor of SaveSignaturesToLocation code * docstrings & cleanup * add 'run' and 'runtmp' test fixtures * remove unnecessary track_abundance fixture call * restore original; * linear and prefetch fixtures + runtmp * fix use of runtmp * copy over SaveSignaturesToLocation code from other branch * docs for sourmash prefetch * more doc * minor edits * Re-implement the actual gather protocol with a cleaner interface. (#1489) * initial refactor of CounterGather stuff * refactor into peek and consume * move next method over to query specific class * replace gather implementation with new CounterGather * many more tests for CounterGather * remove scaled arg from peek * open-box test for counter internal data structures * add num query & subj tests * add repr; add tests; support stdout * refactor signature saving to use new sourmash_args collection saving * specify utf-8 encoding for output * add flexible output to compute/sketch * add test to trigger rust panic * test search --save-matches * add --save-prefetch to sourmash gather * remove --no-prefetch option :) * added --save-prefetch functionality * add back a mostly-functioning --no-prefetch argument :) * add --no-prefetch back in * check for JSON in first byte of LCA DB file * start adding linear tests * use fixtures to test prefetch and linear more thoroughly * comments, etc * upgrade docs for --linear and --prefetch * 'fix' issue and test * fix a last test ;) * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <[email protected]> * Update src/sourmash/cli/sig/rename.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <[email protected]> * write tests for LazyLinearIndex * add some basic prefetch tests * properly test linear! * add more tests for LazyLinearIndex * test zipfile bool * remove unnecessary try/except; comment * fix signatures() call * fix --prefetch snafu; doc * do not overwrite signature even if duplicate md5sum (#1497) * try adding loc to return values from Index.find * made use of new IndexSearchResult.find throughout * adjust note * provide signatures_with_location on all Index objects * cleanup and fix * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <[email protected]> * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <[email protected]> * fix bug around --save-prefetch with multiple databases * comment/doc minor updates Co-authored-by: Luiz Irber <[email protected]> Co-authored-by: Tessa Pierce Ward <[email protected]>

…thon set API. (#1512) * make 'find' a generator * remove comment * begin refactoring 'categorize' * have the 'find' function for SBTs return signatures * fix majority of tests * comment & then fix test * torture the tests into working * split find and _find_nodes to take different kinds of functions * redo 'find' on index * refactor lca_db to use new find * refactor SBT to use new find * comment/cleanup * refactor out common code * fix up gather * use 'passes' properly * attempted cleanup * minor fixes * get a start on correct downsampling * adjust tree downsampling for regular minhashes, too * remove now-unused search functions in sbtmh * refactor categorize to use new find * cleanup and removal * remove redundant code in lca_db * remove redundant code in SBT * add notes * remove more unused code * refactor most of the test_sbt tests * fix one minor issue * fix jaccard calculation in sbt * check for compatibility of search fn and query signature * switch tests over to jaccard similarity, not containment * fix test * remove test for unimplemented LCA_Database.find method * document threshold change; update test * refuse to run abund signatures * flatten sigs internally for gather * reinflate abundances for saving * fix problem where sbt indices coudl be created with abund signatures * more * split flat and abund search * make ignore_abundance work again for categorize * turn off best-only, since it triggers on self-hits. * add test: 'sourmash index' flattens sigs * add note about something to test * fix typo; still broken tho * location is now a property * move search code into search.py * remove redundant scaled checking code * best-only now works properly for two tests * 'fix' tests by removing v1 and v2 SBT compatibility * simplify (?) downsampling code * require keyword args in MinHash.downsample(...) * fix bug with downsample * require keyword args in MinHash.downsample(...) * fix test to use proper downsampling, reverse order to match scaled * add test for revealed bug * remove unnecessary comment * flatten subject MinHash, too * add testme comment * clean up sbt find * clean up lca find * add IndexSearchResult namedtuple for search and gather results * add more tests for Index classes * add tests for subj & query num downsampling * tests for Index.search_abund * refactor a bit * refactor make_jaccard_search_query; start tests * even more tests * test collect, best_only * more search tests * remove unnec space * add minor comment * deal with status == None on SystemExit * upgrade and simplify categorize * restore test * merge * fix abundance search in SBT for categorize * code cleanup and refactoring; check for proper error messages * add explicit test for incompatible num * refactor MinHash.downsample * deal with status == None on SystemExit * fix test * fix comment mispelling * properly pass kwargs; fix search_sbt_index * add simple tests for SBT load and search API * allow arbitrary kwargs for LCA_DAtabase.find * add testing of passthru-kwargs * re-enable test * add notes to update docstrings * docstring updates * fix test * fix location reporting in prefetch * fix prefetch location by fixing MultiIndex * temporary prefetch_gather intervention * 'gather' only returns best match * turn prefetch on by default, for now * better tests for gather --save-unassigned * remove unused print * remove unnecessary check-me comment * clear out docstring * SBT search doesn't work on v1 and v2 SBTs b/c no min_n_below * start adding tests * test some basic prefetch stuff * update index for prefetch * add fairly thorough tests * fix my dumb mistake with gather * simplify, refactor, fix * fix remaining tests * propogate ValueErrors better * fix tests * flatten prefetch queries * fix for genome-grist alpha test * fix threshold bugarooni * fix gather/prefetch interactions * fix sourmash prefetch return value * minor fixes * pay proper attention to threshold * cleanup and refactoring * remove unnecessary 'scaled' * minor cleanup * added LazyLinearLindex and prefetch --linear * fix abundance problem * save matches to a directory * test for saving matches to a directory * add a flexible progressive signature output class * add tests for .sig.gz and .zip outputs * update save_signatures code; add tests; use in gather and search too * update comment * cleanup and refactor of SaveSignaturesToLocation code * docstrings & cleanup * add 'run' and 'runtmp' test fixtures * remove unnecessary track_abundance fixture call * restore original; * linear and prefetch fixtures + runtmp * fix use of runtmp * copy over SaveSignaturesToLocation code from other branch * docs for sourmash prefetch * more doc * minor edits * Re-implement the actual gather protocol with a cleaner interface. (#1489) * initial refactor of CounterGather stuff * refactor into peek and consume * move next method over to query specific class * replace gather implementation with new CounterGather * many more tests for CounterGather * remove scaled arg from peek * open-box test for counter internal data structures * add num query & subj tests * add repr; add tests; support stdout * refactor signature saving to use new sourmash_args collection saving * specify utf-8 encoding for output * add flexible output to compute/sketch * add test to trigger rust panic * test search --save-matches * add --save-prefetch to sourmash gather * remove --no-prefetch option :) * added --save-prefetch functionality * add back a mostly-functioning --no-prefetch argument :) * add --no-prefetch back in * check for JSON in first byte of LCA DB file * start adding linear tests * use fixtures to test prefetch and linear more thoroughly * comments, etc * upgrade docs for --linear and --prefetch * 'fix' issue and test * fix a last test ;) * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <[email protected]> * Update src/sourmash/cli/sig/rename.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <[email protected]> * write tests for LazyLinearIndex * add some basic prefetch tests * properly test linear! * add more tests for LazyLinearIndex * test zipfile bool * remove unnecessary try/except; comment * fix signatures() call * fix --prefetch snafu; doc * do not overwrite signature even if duplicate md5sum (#1497) * try adding loc to return values from Index.find * made use of new IndexSearchResult.find throughout * adjust note * provide signatures_with_location on all Index objects * cleanup and fix * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <[email protected]> * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <[email protected]> * fix bug around --save-prefetch with multiple databases * comment/doc minor updates * move away from Python sets to MinHash objects * return intersect_mh from _find_best * put _subtract_and_downsample inline * clean up and remove old code * remove max_hash * more cleanup Co-authored-by: Luiz Irber <[email protected]> Co-authored-by: Tessa Pierce Ward <[email protected]>

* have the 'find' function for SBTs return signatures * fix majority of tests * comment & then fix test * torture the tests into working * split find and _find_nodes to take different kinds of functions * redo 'find' on index * refactor lca_db to use new find * refactor SBT to use new find * comment/cleanup * refactor out common code * fix up gather * use 'passes' properly * attempted cleanup * minor fixes * get a start on correct downsampling * adjust tree downsampling for regular minhashes, too * remove now-unused search functions in sbtmh * refactor categorize to use new find * cleanup and removal * remove redundant code in lca_db * remove redundant code in SBT * add notes * remove more unused code * refactor most of the test_sbt tests * fix one minor issue * fix jaccard calculation in sbt * check for compatibility of search fn and query signature * switch tests over to jaccard similarity, not containment * fix test * remove test for unimplemented LCA_Database.find method * document threshold change; update test * refuse to run abund signatures * flatten sigs internally for gather * reinflate abundances for saving * fix problem where sbt indices coudl be created with abund signatures * more * split flat and abund search * make ignore_abundance work again for categorize * turn off best-only, since it triggers on self-hits. * add test: 'sourmash index' flattens sigs * add note about something to test * fix typo; still broken tho * location is now a property * move search code into search.py * remove redundant scaled checking code * best-only now works properly for two tests * 'fix' tests by removing v1 and v2 SBT compatibility * simplify (?) downsampling code * require keyword args in MinHash.downsample(...) * fix bug with downsample * require keyword args in MinHash.downsample(...) * fix test to use proper downsampling, reverse order to match scaled * add test for revealed bug * remove unnecessary comment * flatten subject MinHash, too * add testme comment * clean up sbt find * clean up lca find * add IndexSearchResult namedtuple for search and gather results * add more tests for Index classes * add tests for subj & query num downsampling * tests for Index.search_abund * refactor a bit * refactor make_jaccard_search_query; start tests * even more tests * test collect, best_only * more search tests * remove unnec space * add minor comment * deal with status == None on SystemExit * upgrade and simplify categorize * restore test * merge * fix abundance search in SBT for categorize * code cleanup and refactoring; check for proper error messages * add explicit test for incompatible num * refactor MinHash.downsample * deal with status == None on SystemExit * fix test * fix comment mispelling * properly pass kwargs; fix search_sbt_index * add simple tests for SBT load and search API * allow arbitrary kwargs for LCA_DAtabase.find * add testing of passthru-kwargs * re-enable test * add notes to update docstrings * docstring updates * fix test * fix location reporting in prefetch * fix prefetch location by fixing MultiIndex * temporary prefetch_gather intervention * 'gather' only returns best match * turn prefetch on by default, for now * better tests for gather --save-unassigned * remove unused print * remove unnecessary check-me comment * clear out docstring * SBT search doesn't work on v1 and v2 SBTs b/c no min_n_below * start adding tests * test some basic prefetch stuff * update index for prefetch * add fairly thorough tests * fix my dumb mistake with gather * simplify, refactor, fix * fix remaining tests * propogate ValueErrors better * fix tests * flatten prefetch queries * fix for genome-grist alpha test * fix threshold bugarooni * fix gather/prefetch interactions * fix sourmash prefetch return value * minor fixes * pay proper attention to threshold * cleanup and refactoring * remove unnecessary 'scaled' * minor cleanup * added LazyLinearLindex and prefetch --linear * fix abundance problem * save matches to a directory * test for saving matches to a directory * add a flexible progressive signature output class * add tests for .sig.gz and .zip outputs * update save_signatures code; add tests; use in gather and search too * update comment * cleanup and refactor of SaveSignaturesToLocation code * docstrings & cleanup * add 'run' and 'runtmp' test fixtures * remove unnecessary track_abundance fixture call * restore original; * linear and prefetch fixtures + runtmp * fix use of runtmp * copy over SaveSignaturesToLocation code from other branch * docs for sourmash prefetch * more doc * minor edits * Re-implement the actual gather protocol with a cleaner interface. (#1489) * initial refactor of CounterGather stuff * refactor into peek and consume * move next method over to query specific class * replace gather implementation with new CounterGather * many more tests for CounterGather * remove scaled arg from peek * open-box test for counter internal data structures * add num query & subj tests * add repr; add tests; support stdout * refactor signature saving to use new sourmash_args collection saving * specify utf-8 encoding for output * add flexible output to compute/sketch * add test to trigger rust panic * test search --save-matches * add --save-prefetch to sourmash gather * remove --no-prefetch option :) * added --save-prefetch functionality * add back a mostly-functioning --no-prefetch argument :) * add --no-prefetch back in * check for JSON in first byte of LCA DB file * start adding linear tests * use fixtures to test prefetch and linear more thoroughly * comments, etc * upgrade docs for --linear and --prefetch * 'fix' issue and test * fix a last test ;) * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <[email protected]> * Update src/sourmash/cli/sig/rename.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <[email protected]> * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <[email protected]> * write tests for LazyLinearIndex * add some basic prefetch tests * properly test linear! * add more tests for LazyLinearIndex * test zipfile bool * remove unnecessary try/except; comment * fix signatures() call * fix --prefetch snafu; doc * do not overwrite signature even if duplicate md5sum (#1497) * try adding loc to return values from Index.find * made use of new IndexSearchResult.find throughout * adjust note * provide signatures_with_location on all Index objects * cleanup and fix * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <[email protected]> * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <[email protected]> * fix bug around --save-prefetch with multiple databases * comment/doc minor updates * initial trial implementation of ImmutableMinHash * fix tests * provide our own pickle for ImmutableMinHash * ok, a few more plcaes to change. * rename to FrozenMinHash per luiz * finish renaming, add some tests * thanks, I hate the old behavior * copy.copy is no longer needed * docs and an explicit 'frozen' method * switch to using 'to_frozen' and 'to_mutable' Co-authored-by: Luiz Irber <[email protected]> Co-authored-by: Tessa Pierce Ward <[email protected]>

ctb added 11 commits April 28, 2021 13:31

initial refactor of CounterGather stuff

7337563

fix up code a bit

98de5e6

Merge branch 'add/prefetch_cli' into add/prefetch_cli_counter

64f2bed

cleanup and refactor

e4ff7a8

factor back in the explicit current query

125dfbd

refactor into peek and consume

f7e430e

more refactor into peek & consume

b7582d1

move next method over to query specific class

062b0ae

refactor using peek etc.

28c354b

commenting and cleanup

9c108d2

add extra if stmt

e6e0469

This was referenced Apr 29, 2021

[MRG] refactor gather functionality for speed & modularity; provide prefetch functionality. #1370

Merged

[WIP] Index.gather is not doing gather? #1263

Closed

replace gather implementation with new CounterGather

9dcf08f

comments and docstrings

b449738

ctb added 4 commits April 29, 2021 16:53

some results tests for CounterGather

1a4fcc3

many more tests for CounterGather

dce803a

tests for abund and scaled

93b9488

remove scaled arg from peek

b813758

open-box test for counter internal data structures

0f4a33a

ctb changed the title ~~[WIP] re-implement the actual gather protocol with a cleaner interface.~~ [MRG] re-implement the actual gather protocol with a cleaner interface. May 1, 2021

add num query & subj tests

af52c41

luizirber self-requested a review May 2, 2021 01:58

luizirber approved these changes May 2, 2021

View reviewed changes

luizirber merged commit 475a515 into add/prefetch_cli May 2, 2021

luizirber deleted the add/prefetch_cli_counter branch May 2, 2021 02:00

ctb mentioned this pull request Jun 20, 2021

add multipeek functionality to report equal gather matches #1615

Open

ctb mentioned this pull request Mar 12, 2022

can we optimize gather on LCA and SBT databases by saving some state? #930

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] re-implement the actual gather protocol with a cleaner interface. #1489

[MRG] re-implement the actual gather protocol with a cleaner interface. #1489

ctb commented Apr 29, 2021 •

edited

Loading

codecov bot commented Apr 29, 2021 •

edited

Loading

ctb commented Apr 29, 2021 •

edited

Loading

ctb commented Apr 29, 2021

ctb commented Apr 29, 2021

luizirber commented Apr 29, 2021

ctb commented Apr 29, 2021

ctb commented Apr 30, 2021

ctb commented May 1, 2021

luizirber left a comment

[MRG] re-implement the actual gather protocol with a cleaner interface. #1489

[MRG] re-implement the actual gather protocol with a cleaner interface. #1489

Conversation

ctb commented Apr 29, 2021 • edited Loading

TODO:

codecov bot commented Apr 29, 2021 • edited Loading

Codecov Report

ctb commented Apr 29, 2021 • edited Loading

ctb commented Apr 29, 2021

ctb commented Apr 29, 2021

against ~35k GTDB reps database

against ~300k GTDB all database

luizirber commented Apr 29, 2021

ctb commented Apr 29, 2021

ctb commented Apr 30, 2021

ctb commented May 1, 2021

luizirber left a comment

Choose a reason for hiding this comment

ctb commented Apr 29, 2021 •

edited

Loading

codecov bot commented Apr 29, 2021 •

edited

Loading

ctb commented Apr 29, 2021 •

edited

Loading