Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: modify n simultaneous downloads; update buildutils #154

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 21 additions & 22 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@ name = "sourmash_plugin_directsketch"
crate-type = ["cdylib"]

[dependencies]
pyo3 = { version = "0.23.3", features = ["extension-module", "anyhow"] }
pyo3 = { version = "0.23.3", features = ["extension-module","anyhow"]}
rayon = "1.10.0"
serde = { version = "1.0.204", features = ["derive"] }
sourmash = { version = "0.17.2"}
sourmash = { version = "0.18.0"}
serde_json = "1.0.134"
niffler = "2.4.0"
needletail = "0.5.1"
Expand Down
19 changes: 16 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,15 @@ pip install sourmash_plugin_directsketch

## Usage Considerations

If you're building large databases (over 20k files), we highly recommend you use batched zipfiles (v0.4+) to facilitate restart. If you encounter unexpected failures and are using a single zipfile output (default), `gbsketch`/`urlsketch` will have to re-download and re-sketch all files. If you instead set a batch size using `--batch-size`, e.g. 10000, then `gbsketch`/`urlsketch` can load any batched zips that finished writing, and avoid re-generating those signatures. For `gbsketch`, the batch size represents the number of accessions included in each zip, with all signatures associated with an accession grouped within a single `zip`. For `urlsketch`, the batch size represents the number of total signatures included in each zip. Note that batches will use the `--output` file to build batched filenames, so if you provided `output.zip`, your batches will be `output.1.zip`, `output.2.zip`, etc.
### Allowing restart with batching

If you're building large databases, we highly recommend you use batched zipfiles (v0.4+) to facilitate restart. If you encounter unexpected failures and are using a single zipfile output (default), `gbsketch`/`urlsketch` will have to re-download and re-sketch all files. If you instead set a batch size using `--batch-size`, then `gbsketch`/`urlsketch` can load any batched zips that finished writing, and avoid re-generating those signatures. For `gbsketch`, the batch size represents the number of accessions included in each zip, with all signatures associated with an accession grouped within a single `zip`. For `urlsketch`, the batch size represents the number of sigs associated with each url provided. Note that batches will use the `--output` file to build batched filenames, so if you provided `output.zip`, your batches will be `output.1.zip`, `output.2.zip`, etc. For small genomes (e.g. microbes), you can keep batch sizes quite large, e.g. 1000s-10000s. For large eukaryotic genomes where download takes much longer, you may want to use smaller batch sizes.

To build a single database after batched sketching, you can use `sig cat` to build a single zipfile (`sourmash sig cat *.zip -o OUTPUT.zip`) or `sig collect` to collect all the zips into a standalone manifest that can be used with sourmash and branchwater commands.

### Memory Requirements

Directsketch downloads the full file, optionally checking the `md5sum`, then performs the sketch. As a result, you will need enough memory to hold up to 3 genomes in memory at once. For microbial genomes, this is trivial. For large eukaryotic genomes (e.g. plants!), be sure to provide sufficient memory. You can tune the number of simultaneous downloads (and thus, the number of genomes that will be in memory simultaneously) with `--n-simultaneous-downloads`.

## Running the commands

Expand Down Expand Up @@ -125,7 +132,8 @@ options:
-f FASTAS, --fastas FASTAS
Write fastas here
--batch-size BATCH_SIZE
Write smaller zipfiles, each containing sigs associated with this number of accessions. This allows gbsketch to recover after unexpected failures, rather than needing to
Write smaller zipfiles, each containing sigs associated with this number of accessions.
This allows gbsketch to recover after unexpected failures, rather than needing to
restart sketching from scratch. Default: write all sigs to single zipfile.
-k, --keep-fasta write FASTA files in addition to sketching. Default: do not write FASTA files
--download-only just download genomes; do not sketch
Expand All @@ -138,6 +146,8 @@ options:
number of cores to use (default is all available)
-r RETRY_TIMES, --retry-times RETRY_TIMES
number of times to retry failed downloads
-n {1,2,3}, --n-simultaneous-downloads {1,2,3}
number of accessions to download simultaneously (default=1)
-g, --genomes-only just download and sketch genome (DNA) files
-m, --proteomes-only just download and sketch proteome (protein) files
```
Expand Down Expand Up @@ -186,7 +196,8 @@ options:
-o OUTPUT, --output OUTPUT
output zip file for the signatures
--batch-size BATCH_SIZE
Write smaller zipfiles, each containing sigs associated with this number of accessions. This allows urlsketch to recover after unexpected failures, rather than needing to
Write smaller zipfiles, each containing sigs associated with this number of urls.
This allows urlsketch to recover after unexpected failures, rather than needing to
restart sketching from scratch. Default: write all sigs to single zipfile.
-f FASTAS, --fastas FASTAS
Write fastas here
Expand All @@ -202,6 +213,8 @@ options:
number of cores to use (default is all available)
-r RETRY_TIMES, --retry-times RETRY_TIMES
number of times to retry failed downloads
-n {1,2,3}, --n-simultaneous-downloads {1,2,3}
number of simultaneous downloads (default=3)
```

## Code of Conduct
Expand Down
Loading
Loading