Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] refactor gather functionality for speed & modularity; provide prefetch functionality. #1370

Merged
merged 258 commits into from
May 10, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
258 commits
Select commit Hold shift + click to select a range
23eea6d
more refactor - filename stuff
ctb Mar 6, 2021
d7e3064
add 'location' to SBT objects
ctb Mar 6, 2021
e7a13a3
finish removing filename
ctb Mar 6, 2021
69f019e
Merge branch 'refactor/databases' into add/prefetch_index
ctb Mar 6, 2021
11251ab
fix prefetch after merging in #1373
ctb Mar 6, 2021
9bfa690
implement a CounterGatherIndex
ctb Mar 7, 2021
033a764
remove sort
ctb Mar 7, 2021
ff78a1b
update counter logic to remove proper intersection
ctb Mar 8, 2021
6b4be47
Merge branch 'latest' of github.com:dib-lab/sourmash into add/prefetc…
ctb Mar 9, 2021
70168f1
make 'find' a generator
ctb Mar 9, 2021
fb8ae9c
Merge branch 'latest' of github.com:dib-lab/sourmash into add/prefetc…
ctb Mar 9, 2021
6e2b2cf
Merge branch 'add/prefetch_cli' into add/prefetch_index
ctb Mar 9, 2021
6f30528
remove comment
ctb Mar 9, 2021
90f769a
Merge branch 'latest' into add/prefetch_cli
ctb Mar 9, 2021
d00cb4c
Merge branch 'add/prefetch_cli' into add/prefetch_index
ctb Mar 9, 2021
4c09f5b
begin refactoring 'categorize'
ctb Mar 12, 2021
af6fd84
have the 'find' function for SBTs return signatures
ctb Mar 12, 2021
8a92936
fix majority of tests
ctb Mar 12, 2021
5138e83
Merge branch 'latest' of github.com:dib-lab/sourmash into add/prefetc…
ctb Mar 12, 2021
f3ed42f
Merge branch 'add/prefetch_cli' into add/prefetch_index
ctb Mar 12, 2021
c4adabf
Merge branch 'latest' of github.com:dib-lab/sourmash into fix/sbt_find
ctb Mar 12, 2021
cdb4159
comment & then fix test
ctb Mar 12, 2021
a414624
torture the tests into working
ctb Mar 12, 2021
6f7d368
split find and _find_nodes to take different kinds of functions
ctb Mar 13, 2021
7b2f624
Merge branch 'fix/sbt_find' into refactor/categorize
ctb Mar 13, 2021
b5ab6d7
redo 'find' on index
ctb Mar 13, 2021
ed7d52b
refactor lca_db to use new find
ctb Mar 13, 2021
aec730e
refactor SBT to use new find
ctb Mar 13, 2021
590b3d6
comment/cleanup
ctb Mar 13, 2021
eb7d661
refactor out common code
ctb Mar 13, 2021
0639c3e
fix up gather
ctb Mar 13, 2021
a65c79b
use 'passes' properly
ctb Mar 13, 2021
02794ee
attempted cleanup
ctb Mar 13, 2021
f94e909
minor fixes
ctb Mar 13, 2021
c3a65ac
get a start on correct downsampling
ctb Mar 13, 2021
9054cb8
adjust tree downsampling for regular minhashes, too
ctb Mar 13, 2021
db740ec
remove now-unused search functions in sbtmh
ctb Mar 13, 2021
03a5e60
refactor categorize to use new find
ctb Mar 13, 2021
b3718dd
cleanup and removal
ctb Mar 13, 2021
e8e4702
remove redundant code in lca_db
ctb Mar 13, 2021
b40963c
remove redundant code in SBT
ctb Mar 13, 2021
055bd60
add notes
ctb Mar 13, 2021
2329009
remove more unused code
ctb Mar 13, 2021
e6d90f6
refactor most of the test_sbt tests
ctb Mar 13, 2021
2baa8c3
fix one minor issue
ctb Mar 13, 2021
0ec99ea
fix jaccard calculation in sbt
ctb Mar 13, 2021
c583a37
check for compatibility of search fn and query signature
ctb Mar 13, 2021
d565e67
switch tests over to jaccard similarity, not containment
ctb Mar 13, 2021
8eb43f7
fix test
ctb Mar 13, 2021
5c75e39
remove test for unimplemented LCA_Database.find method
ctb Mar 13, 2021
83ee16b
document threshold change; update test
ctb Mar 14, 2021
7bfa0e1
refuse to run abund signatures
ctb Mar 14, 2021
2c28568
flatten sigs internally for gather
ctb Mar 14, 2021
9adae36
reinflate abundances for saving
ctb Mar 14, 2021
c979b17
fix problem where sbt indices coudl be created with abund signatures
ctb Mar 14, 2021
0bf34cd
more
ctb Mar 15, 2021
3844b02
split flat and abund search
ctb Mar 16, 2021
f6fe0de
make ignore_abundance work again for categorize
ctb Mar 16, 2021
863e4de
turn off best-only, since it triggers on self-hits.
ctb Mar 16, 2021
731df73
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Mar 16, 2021
21e8867
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Mar 20, 2021
80c14c2
add test: 'sourmash index' flattens sigs
ctb Mar 20, 2021
138bd16
add note about something to test
ctb Mar 20, 2021
9dcf25b
Merge branch 'latest' into add/prefetch_cli
ctb Apr 3, 2021
d438f9c
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 3, 2021
e406a99
fix typo; still broken tho
ctb Apr 3, 2021
dc91322
Merge branch 'add/prefetch_cli' of github.com:dib-lab/sourmash into a…
ctb Apr 3, 2021
182ad62
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 4, 2021
74c925d
location is now a property
ctb Apr 4, 2021
87811a4
move search code into search.py
ctb Apr 4, 2021
45b1f5e
remove redundant scaled checking code
ctb Apr 4, 2021
7b76751
best-only now works properly for two tests
ctb Apr 4, 2021
2248b06
'fix' tests by removing v1 and v2 SBT compatibility
ctb Apr 4, 2021
0aa4bd2
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 9, 2021
66dc4a7
simplify (?) downsampling code
ctb Apr 9, 2021
b7a3ba2
require keyword args in MinHash.downsample(...)
ctb Apr 9, 2021
7d3885e
fix bug with downsample
ctb Apr 9, 2021
c686662
require keyword args in MinHash.downsample(...)
ctb Apr 9, 2021
39d13cc
fix test to use proper downsampling, reverse order to match scaled
ctb Apr 9, 2021
86e1f41
add test for revealed bug
ctb Apr 9, 2021
78aa70c
remove unnecessary comment
ctb Apr 9, 2021
d4b291a
Merge branch 'fix/downsample_kwargs' into refactor/index_find
ctb Apr 9, 2021
cb712c0
flatten subject MinHash, too
ctb Apr 9, 2021
ba7352e
add testme comment
ctb Apr 9, 2021
31d08e0
clean up sbt find
ctb Apr 9, 2021
9feda90
clean up lca find
ctb Apr 9, 2021
9b9d518
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 10, 2021
36cc35e
add IndexSearchResult namedtuple for search and gather results
ctb Apr 10, 2021
a6cd259
add more tests for Index classes
ctb Apr 10, 2021
54126ae
add tests for subj & query num downsampling
ctb Apr 10, 2021
16c464e
tests for Index.search_abund
ctb Apr 10, 2021
2e0bc9d
refactor a bit
ctb Apr 10, 2021
87ffe00
refactor make_jaccard_search_query; start tests
ctb Apr 10, 2021
1a4cfd4
even more tests
ctb Apr 10, 2021
184e541
test collect, best_only
ctb Apr 10, 2021
ebd5aac
more search tests
ctb Apr 10, 2021
430cb2e
remove unnec space
ctb Apr 10, 2021
b218540
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 11, 2021
cc2ec29
add minor comment
ctb Apr 11, 2021
c2b4eda
deal with status == None on SystemExit
ctb Apr 11, 2021
1bda989
upgrade and simplify categorize
ctb Apr 11, 2021
a7f5306
restore test
ctb Apr 11, 2021
2db2586
merge
ctb Apr 11, 2021
8c84397
fix abundance search in SBT for categorize
ctb Apr 13, 2021
1c6a539
code cleanup and refactoring; check for proper error messages
ctb Apr 13, 2021
8af9187
add explicit test for incompatible num
ctb Apr 14, 2021
379743d
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 14, 2021
5b4b5ed
refactor MinHash.downsample
ctb Apr 14, 2021
1e70d07
deal with status == None on SystemExit
ctb Apr 11, 2021
495f0bf
fix test
ctb Apr 14, 2021
1660df5
fix comment mispelling
ctb Apr 14, 2021
77f6e0a
properly pass kwargs; fix search_sbt_index
ctb Apr 14, 2021
72639bd
add simple tests for SBT load and search API
ctb Apr 14, 2021
e916214
Merge branch 'refactor/minhash_downsample' into refactor/index_find
ctb Apr 14, 2021
a735445
Merge branch 'fix/sys_exit_none' into refactor/index_find
ctb Apr 14, 2021
922db44
Merge branch 'fix/search_sbt_index' into refactor/index_find
ctb Apr 14, 2021
5b8d83c
allow arbitrary kwargs for LCA_DAtabase.find
ctb Apr 14, 2021
8adc01c
add testing of passthru-kwargs
ctb Apr 15, 2021
f70af9c
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 15, 2021
b07c61d
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 15, 2021
d9c07ce
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 16, 2021
5b308bc
re-enable test
ctb Apr 16, 2021
02c04d6
add notes to update docstrings
ctb Apr 16, 2021
e4e542a
Merge branch 'refactor/index_find' into merge_find_and_prefetch
ctb Apr 16, 2021
c052319
Merge branch 'add/prefetch_index' into merge_find_and_prefetch
ctb Apr 16, 2021
db52ee7
docstring updates
ctb Apr 16, 2021
c50dcdb
fix test
ctb Apr 16, 2021
e4cfe97
Merge branch 'latest' into refactor/index_find
luizirber Apr 16, 2021
11b7486
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 16, 2021
b072090
Merge branch 'latest' of github.com:dib-lab/sourmash into merge_find_…
ctb Apr 17, 2021
9c6d368
Merge branch 'refactor/index_find' into merge_find_and_prefetch
ctb Apr 17, 2021
c067af1
fix location reporting in prefetch
ctb Apr 17, 2021
a4ed221
fix prefetch location by fixing MultiIndex
ctb Apr 17, 2021
e48588d
temporary prefetch_gather intervention
ctb Apr 17, 2021
96ca217
'gather' only returns best match
ctb Apr 17, 2021
c0b2735
turn prefetch on by default, for now
ctb Apr 17, 2021
637723b
Merge branch 'latest' into refactor/index_find
ctb Apr 17, 2021
7759314
better tests for gather --save-unassigned
ctb Apr 18, 2021
8376ce5
Merge branch 'refactor/index_find' of github.com:dib-lab/sourmash int…
ctb Apr 18, 2021
e877490
Merge branch 'refactor/index_find' into merge_find_and_prefetch
ctb Apr 18, 2021
423fff4
remove unused print
ctb Apr 18, 2021
593a907
remove unnecessary check-me comment
ctb Apr 19, 2021
4132162
clear out docstring
ctb Apr 19, 2021
23166df
SBT search doesn't work on v1 and v2 SBTs b/c no min_n_below
ctb Apr 19, 2021
c494032
Merge branch 'latest' of github.com:dib-lab/sourmash into add/prefetc…
ctb Apr 19, 2021
3cf42f0
start adding tests
ctb Apr 19, 2021
3a5901e
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…
ctb Apr 20, 2021
4c0362e
Merge branch 'latest' of github.com:dib-lab/sourmash into add/prefetc…
ctb Apr 21, 2021
18219ae
test some basic prefetch stuff
ctb Apr 21, 2021
50a94d4
Merge branch 'add/prefetch_cli' into merge_find_and_prefetch
ctb Apr 21, 2021
3ed4af0
update index for prefetch
ctb Apr 21, 2021
ba8beb6
add fairly thorough tests
ctb Apr 21, 2021
79e0166
Merge branch 'add/prefetch_cli' into merge_find_and_prefetch
ctb Apr 21, 2021
57467cd
fix my dumb mistake with gather
ctb Apr 21, 2021
06f5d03
Merge branch 'refactor/index_find' into merge_find_and_prefetch
ctb Apr 21, 2021
16f1ee2
Merge branch 'latest' of github.com:dib-lab/sourmash into merge_find_…
ctb Apr 22, 2021
98957b8
simplify, refactor, fix
ctb Apr 22, 2021
67e7954
fix remaining tests
ctb Apr 22, 2021
c9109a6
Merge branch 'latest' of github.com:dib-lab/sourmash into add/prefetc…
ctb Apr 23, 2021
3151ff5
propogate ValueErrors better
ctb Apr 23, 2021
634e84e
fix tests
ctb Apr 23, 2021
7852fa1
flatten prefetch queries
ctb Apr 24, 2021
808ae37
fix for genome-grist alpha test
ctb Apr 24, 2021
eb178bb
fix threshold bugarooni
ctb Apr 24, 2021
ee7a6c2
fix gather/prefetch interactions
ctb Apr 24, 2021
174ebbe
fix sourmash prefetch return value
ctb Apr 24, 2021
bea17b3
minor fixes
ctb Apr 24, 2021
ad03e1e
pay proper attention to threshold
ctb Apr 24, 2021
cf86954
cleanup and refactoring
ctb Apr 25, 2021
293fc43
remove unnecessary 'scaled'
ctb Apr 25, 2021
fb87777
minor cleanup
ctb Apr 25, 2021
7631157
added LazyLinearLindex and prefetch --linear
ctb Apr 25, 2021
87be7fa
fix abundance problem
ctb Apr 26, 2021
f90a21f
save matches to a directory
ctb Apr 26, 2021
18d72c4
test for saving matches to a directory
ctb Apr 26, 2021
b1d54df
add a flexible progressive signature output class
ctb Apr 27, 2021
f1556d0
add tests for .sig.gz and .zip outputs
ctb Apr 27, 2021
65b7cbe
update save_signatures code; add tests; use in gather and search too
ctb Apr 27, 2021
9680355
update comment
ctb Apr 27, 2021
f1b742c
cleanup and refactor of SaveSignaturesToLocation code
ctb Apr 28, 2021
a9e5221
docstrings & cleanup
ctb Apr 28, 2021
67e000e
add 'run' and 'runtmp' test fixtures
ctb Apr 28, 2021
ee4b7a0
remove unnecessary track_abundance fixture call
ctb Apr 28, 2021
255014e
restore original;
ctb Apr 28, 2021
c6a607c
Merge branch 'add/run_fixtures' into add/prefetch_cli
ctb Apr 28, 2021
e0ee951
linear and prefetch fixtures + runtmp
ctb Apr 28, 2021
15fb06f
Merge branch 'latest' of github.com:dib-lab/sourmash into add/prefetc…
ctb Apr 28, 2021
4f11bff
fix use of runtmp
ctb Apr 28, 2021
903239b
Merge branch 'latest' of github.com:dib-lab/sourmash into add/prefetc…
ctb Apr 28, 2021
591f3b1
Merge branch 'latest' of github.com:dib-lab/sourmash into add/prefetc…
ctb May 1, 2021
83742b9
copy over SaveSignaturesToLocation code from other branch
ctb May 1, 2021
36defa7
docs for sourmash prefetch
ctb May 1, 2021
10c700a
more doc
ctb May 1, 2021
941afdb
minor edits
ctb May 2, 2021
475a515
Re-implement the actual gather protocol with a cleaner interface. (#1…
ctb May 2, 2021
b196ecc
add repr; add tests; support stdout
ctb May 2, 2021
af0f49c
refactor signature saving to use new sourmash_args collection saving
ctb May 2, 2021
c613b43
specify utf-8 encoding for output
ctb May 2, 2021
e19861c
add flexible output to compute/sketch
ctb May 2, 2021
0878218
add test to trigger rust panic
ctb May 2, 2021
345513f
test search --save-matches
ctb May 2, 2021
4ce0f7b
Merge branch 'add/save_signatures_to_loc' into add/prefetch_cli
ctb May 3, 2021
7c117e5
add --save-prefetch to sourmash gather
ctb May 3, 2021
b731df3
Merge branch 'add/prefetch_cli' of github.com:dib-lab/sourmash into a…
ctb May 3, 2021
78e8ef3
remove --no-prefetch option :)
ctb May 3, 2021
d9ad9af
added --save-prefetch functionality
ctb May 3, 2021
b1f79fa
add back a mostly-functioning --no-prefetch argument :)
ctb May 3, 2021
8eeb5c1
add --no-prefetch back in
ctb May 3, 2021
f6fdee3
check for JSON in first byte of LCA DB file
ctb May 3, 2021
566a127
Merge branch 'update/lca_db_load' into add/prefetch_cli
ctb May 3, 2021
2acc218
start adding linear tests
ctb May 3, 2021
d7494a6
use fixtures to test prefetch and linear more thoroughly
ctb May 3, 2021
e64fc47
comments, etc
ctb May 3, 2021
45b36ae
upgrade docs for --linear and --prefetch
ctb May 3, 2021
b3ba89f
'fix' issue and test
ctb May 3, 2021
a17e76b
Merge branch 'add/save_signatures_to_loc' into add/prefetch_cli
ctb May 3, 2021
32fd87d
fix a last test ;)
ctb May 3, 2021
10522c1
Update doc/command-line.md
ctb May 4, 2021
a15ebb9
Update src/sourmash/cli/sig/rename.py
ctb May 4, 2021
f20c354
Update tests/test_sourmash_args.py
ctb May 4, 2021
bb3a0cd
Update tests/test_sourmash_args.py
ctb May 4, 2021
02c6fca
Update tests/test_sourmash_args.py
ctb May 4, 2021
b1f8a8e
Update tests/test_sourmash_args.py
ctb May 4, 2021
1f58564
Update tests/test_sourmash_args.py
ctb May 4, 2021
833645b
Update doc/command-line.md
ctb May 4, 2021
2019e81
Merge branch 'add/save_signatures_to_loc' of github.com:dib-lab/sourm…
ctb May 4, 2021
a4b573a
write tests for LazyLinearIndex
ctb May 4, 2021
1e0f94d
add some basic prefetch tests
ctb May 5, 2021
1b0a424
properly test linear!
ctb May 5, 2021
1135fd8
Merge branch 'latest' of github.com:dib-lab/sourmash into add/prefetc…
ctb May 5, 2021
92ee772
add more tests for LazyLinearIndex
ctb May 5, 2021
6b2668f
test zipfile bool
ctb May 5, 2021
c100bf0
remove unnecessary try/except; comment
ctb May 5, 2021
53ec3cf
fix signatures() call
ctb May 5, 2021
8c3b67a
fix --prefetch snafu; doc
ctb May 5, 2021
b4cdbe8
do not overwrite signature even if duplicate md5sum (#1497)
ctb May 5, 2021
c158c69
Merge branch 'latest' into add/save_signatures_to_loc
ctb May 5, 2021
bc7802c
Merge branch 'add/save_signatures_to_loc' into add/prefetch_cli
ctb May 5, 2021
f9fcfb6
Merge branch 'latest' of github.com:dib-lab/sourmash into add/prefetc…
ctb May 5, 2021
b1e82ba
try adding loc to return values from Index.find
ctb May 5, 2021
3b5be03
made use of new IndexSearchResult.find throughout
ctb May 6, 2021
1eef0f1
adjust note
ctb May 6, 2021
4d080f1
provide signatures_with_location on all Index objects
ctb May 6, 2021
028487f
cleanup and fix
ctb May 6, 2021
9a3c1fe
Update doc/command-line.md
ctb May 6, 2021
66e3b6c
Update doc/command-line.md
ctb May 6, 2021
2a33d41
fix bug around --save-prefetch with multiple databases
ctb May 7, 2021
394da46
comment/doc minor updates
ctb May 7, 2021
958d465
Merge branch 'latest' of github.com:dib-lab/sourmash into add/prefetc…
ctb May 7, 2021
92a2511
Merge branch 'latest' of github.com:dib-lab/sourmash into add/prefetc…
ctb May 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 81 additions & 5 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,16 +57,17 @@ species, while the third is from a completely different genus.

To get a list of subcommands, run `sourmash` without any arguments.

There are six main subcommands: `sketch`, `compare`, `plot`,
`search`, `gather`, and `index`. See [the tutorial](tutorials.md) for a
walkthrough of these commands.
There are seven main subcommands: `sketch`, `compare`, `plot`,
`search`, `gather`, `index`, and `prefetch`. See
[the tutorial](tutorials.md) for a walkthrough of these commands.

* `sketch` creates signatures.
* `compare` compares signatures and builds a distance matrix.
* `plot` plots distance matrices created by `compare`.
* `search` finds matches to a query signature in a collection of signatures.
* `gather` finds the best reference genomes for a metagenome, using the provided collection of signatures.
* `index` builds a fast index for many (thousands) of signatures.
* `prefetch` selects signatures of interest from a very large collection of signatures, for later processing.

There are also a number of commands that work with taxonomic
information; these are grouped under the `sourmash lca`
Expand Down Expand Up @@ -295,6 +296,29 @@ genomes with no (or incomplete) taxonomic information. Use `sourmash
lca summarize` to classify a metagenome using a collection of genomes
with taxonomic information.

### Alternative search mode for low-memory (but slow) search: `--linear`

By default, `sourmash gather` uses all information available for
faster search. In particular, for SBTs, `prefetch` will prune the search
tree. This can be slow and/or memory intensive for very large databases,
and `--linear` asks `sourmash prefetch` to instead use a linear search
across all leaf nodes in the tree.

The results are the same whether `--no-linear` or `--linear` is
used.

### Alternative search mode: `--no-prefetch`

By default, `sourmash gather` does a "prefetch" to find *all* candidate
signatures across all databases, before removing overlaps between the
candidates. In rare circumstances, depending on the databases and parameters
used, this may be slower or more memory intensive than doing iterative
overlap removal. Prefetch behavior can be turned off with `--no-prefetch`.

The results are the same whether `--prefetch` or `--no-prefetch` is
used. This option can be used with or without `--linear` (although
`--no-prefetch --linear` will generally be MUCH slower).

### `sourmash index` - build an SBT index of signatures

The `sourmash index` command creates a Zipped SBT database
Expand All @@ -305,11 +329,11 @@ used to create databases for e.g. subsets of GenBank.
These databases support fast search and gather on large collections
of signatures in low memory.

SBTs can only be created on scaled signatures, and all signatures in
All signatures in
an SBT must be of compatible types (i.e. the same k-mer size and
molecule type). You can specify the usual command line selectors
(`-k`, `--scaled`, `--dna`, `--protein`, etc.) to pick out the types
of signatures to include.
of signatures to include when running `index`.

Usage:
```
Expand All @@ -326,6 +350,58 @@ containing a list of file names to index; you can also provide individual
signature files, directories full of signatures, or other sourmash
databases.

### `sourmash prefetch` - select subsets of very large databases for more processing

The `prefetch` subcommand searches a collection of scaled signatures
for matches in a large database, using containment. It is similar to
`search --containment`, while taking a `--threshold-bp` argument like
`gather` does for thresholding matches (instead of using Jaccard
similarity or containment).

`sourmash prefetch` is intended to select a subset of a large database
for further processing. As such, it can search very large collections
of signatures (potentially millions or more), operates in very low
memory (see `--linear` option, below), and does no post-processing of signatures.

`prefetch` has four main output options, which can all be used individually
or together:
* `-o/--output` produces a CSV summary file;
* `--save-matches` saves all matching signatures;
* `-save-matching-hashes` saves a single signature containing all of the hashes that matched any signature in the database at or above the specified threshold;
* `--save-unmatched-hashes` saves a single signature containing the complement of `--save-matching-hashes`.

Other options include:
* the usual `-k/--ksize` and `--dna`/`--protein`/`--dayhoff`/`--hp` signature selectors;
* `--threshold-bp` to require a minimum estimated bp overlap for output;
* `--scaled` for downsampling;
* `--force` to continue past survivable errors;

### Alternative search mode for low-memory (but slow) search: `--linear`

By default, `sourmash prefetch` uses all information available for
faster search. In particular, for SBTs, `prefetch` will prune the search
tree. This can be slow and/or memory intensive for very large databases,
and `--linear` asks `sourmash prefetch` to instead use a linear search
across all leaf nodes in the tree.

### Caveats and comments

`sourmash prefetch` provides no guarantees on output order. It runs in
"streaming mode" on its inputs, in that each input file is loaded,
searched, and then unloaded. And `sourmash prefetch` can be run
separately on multiple databases, after which the results can be
searched in combination with `search`, `gather`, `compare`, etc.

A motivating use case for `sourmash prefetch` is to run it on multiple
large databases with a metagenome query using `--threshold-bp=0`,
`--save-matching-hashes matching_hashes.sig`, and `--save-matches
db-matches.sig`, and then run `sourmash gather matching-hashes.sig
db-matches.sig`.

This combination of commands ensures that the more time- and
memory-intensive `gather` step is run only on a small set of relevant
signatures, rather than all the signatures in the database.

## `sourmash lca` subcommands for taxonomic classification

These commands use LCA databases (created with `lca index`, below, or
Expand Down
1 change: 1 addition & 0 deletions src/sourmash/cli/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from . import migrate
from . import multigather
from . import plot
from . import prefetch
from . import sbt_combine
from . import search
from . import watch
Expand Down
24 changes: 23 additions & 1 deletion src/sourmash/cli/gather.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,14 @@ def subparser(subparsers):
)
subparser.add_argument(
'--save-matches', metavar='FILE',
help='save the matched signatures from the database to the '
help='save gather matched signatures from the database to the '
'specified file'
)
subparser.add_argument(
'--save-prefetch', metavar='FILE',
help='save all prefetch-matched signatures from the databases to the '
'specified file or directory'
)
subparser.add_argument(
'--threshold-bp', metavar='REAL', type=float, default=5e4,
help='reporting threshold (in bp) for estimated overlap with remaining query (default=50kb)'
Expand Down Expand Up @@ -58,6 +63,23 @@ def subparser(subparsers):
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)

# advanced parameters
subparser.add_argument(
'--linear', dest="linear", action='store_true',
help="force a low-memory but maybe slower database search",
)
subparser.add_argument(
'--no-linear', dest="linear", action='store_false',
)
subparser.add_argument(
'--no-prefetch', dest="prefetch", action='store_false',
help="do not use prefetch before gather; see documentation",
)
subparser.add_argument(
'--prefetch', dest="prefetch", action='store_true',
help="use prefetch before gather; see documentation",
)


def main(args):
import sourmash
Expand Down
70 changes: 70 additions & 0 deletions src/sourmash/cli/prefetch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
"""search a signature against dbs, find all overlaps"""

from sourmash.cli.utils import add_ksize_arg, add_moltype_args


def subparser(subparsers):
subparser = subparsers.add_parser('prefetch')
subparser.add_argument('query', help='query signature')
subparser.add_argument("databases",
nargs="*",
help="one or more databases to search",
)
subparser.add_argument(
"--db-from-file",
default=None,
help="list of paths containing signatures to search"
)
subparser.add_argument(
"--linear", action='store_true',
help="force linear traversal of indexes to minimize loading time and memory use"
)
subparser.add_argument(
'--no-linear', dest="linear", action='store_false',
)

subparser.add_argument(
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
subparser.add_argument(
'-d', '--debug', action='store_true'
)
subparser.add_argument(
'-o', '--output', metavar='FILE',
help='output CSV containing matches to this file'
)
subparser.add_argument(
'--save-matches', metavar='FILE',
help='save all matching signatures from the databases to the '
'specified file or directory'
)
subparser.add_argument(
'--threshold-bp', metavar='REAL', type=float, default=5e4,
help='reporting threshold (in bp) for estimated overlap with remaining query hashes (default=50kb)'
)
subparser.add_argument(
'--save-unmatched-hashes', metavar='FILE',
help='output unmatched query hashes as a signature to the '
'specified file'
)
subparser.add_argument(
'--save-matching-hashes', metavar='FILE',
help='output matching query hashes as a signature to the '
'specified file'
)
subparser.add_argument(
'--scaled', metavar='FLOAT', type=float, default=None,
help='downsample signatures to the specified scaled factor'
)
subparser.add_argument(
'--md5', default=None,
help='select the signature with this md5 as query'
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)


def main(args):
import sourmash
return sourmash.commands.prefetch(args)
Loading