-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to support lazy/streaming load with standard sourmash functions: standalone manifests #3023
Comments
ctb
changed the title
how to support streaming load with standard sourmash functions
how to support streaming load with standard sourmash functions: standalone manifests
Feb 22, 2024
This was referenced Feb 22, 2024
ctb
changed the title
how to support streaming load with standard sourmash functions: standalone manifests
how to support lazy/streaming load with standard sourmash functions: standalone manifests
Mar 5, 2024
ctb
added a commit
that referenced
this issue
Mar 20, 2024
This PR: * fixes a minor nit in `sourmash sig collect` output where it said "loaded 0 signatures" * updates a lot of the documentation around standalone manifests to encourage their use * in tandem, modifies docs to discourage loading from pathlists/from-files and directory hierarchies TODO: - [x] look at TODO item re directories in sig collect - [x] think about adding #3023 information into docs about lazy loading; maybe in the advanced databases document? - [x] update `sig manifest` docs to point out that they do not generate standalone manifests - [x] revisit branchwater plugin documentation to, to either make issues or make changes - [x] update `sig check` and `sig collect` to tell people to expand their paths ref #3039 - [x] update docs more to recommend against pathlists and directories per #3040 Related issues: * sourmash-bio/sourmash_plugin_branchwater#235 * Fixes #3048 * Fixes #3009 by recommending `sig collect` and `sig check` instead of `sig manifest` for making standalone manifests * #3053 * Fixes #3023 * Fixes #3039 * Fixes #3040 --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tessa Pierce Ward <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@AnneliektH asked me yesterday how to best provide a list of metagenome sketches to
mgmanysearch
(see https://github.com/sourmash-bio/sourmash_plugin_containment_search/). I realized I wasn't 100% sure of the answer, despite having written this:(Part of my confusion was that the text above is being used through Rust functionality, not through standard Python loading functions.)
mgmanysearch
uses standard sourmash loading functions, so I thought an investigation would be useful and lead to some add'l sourmash documentation too!tl;dr don't use pathlists, use manifests.
the script
I wrote the following Python script:
the execution
and then ran it on a pathlist containing a list of filenames:
and on a manifest generated with
sourmash sig collect $(cat pathlist.txt) -o mf.csv -F csv
results
When using pathlists, all sketches are loaded at once at the beginning, consuming All The Memory.
When using manifests, all sketches are loaded on demand, not consuming All the Memory.
other thoughts
This is another reason to use .zip files to store sketches, instead of sig.gz files;
sig collect
will need to load the actual sketches in sig.gz files in order to build the manifest, while the manifest is already available in .zip files.tl;dr
mgmanysearch
,sig collect
to build a manifest across some or all of them,TODO: verify that
sig collect
loads things on the command line progressively 😅Related issues:
Index
classes from command-line for lower memory #1899The text was updated successfully, but these errors were encountered: