Skip to content

Commit

Permalink
upd batch info to better reflect method
Browse files Browse the repository at this point in the history
  • Loading branch information
bluegenes committed Dec 10, 2024
1 parent d2d90f0 commit bd24b26
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 13 deletions.
15 changes: 12 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,15 @@ pip install sourmash_plugin_directsketch

## Usage Considerations

If you're building large databases (over 20k files), we highly recommend you use batched zipfiles (v0.4+) to facilitate restart. If you encounter unexpected failures and are using a single zipfile output (default), `gbsketch`/`urlsketch` will have to re-download and re-sketch all files. If you instead set a batch size using `--batch-size`, e.g. 10000, then `gbsketch`/`urlsketch` can load any batched zips that finished writing, and avoid re-generating those signatures. For `gbsketch`, the batch size represents the number of accessions included in each zip, with all signatures associated with an accession grouped within a single `zip`. For `urlsketch`, the batch size represents the number of total signatures included in each zip. Note that batches will use the `--output` file to build batched filenames, so if you provided `output.zip`, your batches will be `output.1.zip`, `output.2.zip`, etc.
### Allowing restart with batching

If you're building large databases, we highly recommend you use batched zipfiles (v0.4+) to facilitate restart. If you encounter unexpected failures and are using a single zipfile output (default), `gbsketch`/`urlsketch` will have to re-download and re-sketch all files. If you instead set a batch size using `--batch-size`, then `gbsketch`/`urlsketch` can load any batched zips that finished writing, and avoid re-generating those signatures. For `gbsketch`, the batch size represents the number of accessions included in each zip, with all signatures associated with an accession grouped within a single `zip`. For `urlsketch`, the batch size represents the number of sigs associated with each url provided. Note that batches will use the `--output` file to build batched filenames, so if you provided `output.zip`, your batches will be `output.1.zip`, `output.2.zip`, etc. For small genomes (e.g. microbes), you can keep batch sizes very large, e.g. 1000-10000. For large eukaryotic genomes where download takes much longer, you may want to use batch sizes in the hundreds.

To build a single database after batched sketching, you can use `sig cat` to build a single zipfile (`sourmash sig cat *.zip -o OUTPUT.zip`) or `sig collect` to collect all the zips into a standalone manifest that can be used with sourmash and branchwater commands.

### Memory Requirements

Directsketch downloads the full file, optionally checking the `md5sum`, then performs the sketch. As a result, you will need enough memory to hold up to 3 genomes in memory at once. For microbial genomes, this is trivial. For large eukaryotic genomes (e.g. plants!), be sure to provide sufficient memory. You can tune the number of simultaneous downloads (and thus, the number of genomes that will be in memory simultaneously) with `--n-simultaneous-downloads`.

## Running the commands

Expand Down Expand Up @@ -125,7 +132,8 @@ options:
-f FASTAS, --fastas FASTAS
Write fastas here
--batch-size BATCH_SIZE
Write smaller zipfiles, each containing sigs associated with this number of accessions. This allows gbsketch to recover after unexpected failures, rather than needing to
Write smaller zipfiles, each containing sigs associated with this number of accessions.
This allows gbsketch to recover after unexpected failures, rather than needing to
restart sketching from scratch. Default: write all sigs to single zipfile.
-k, --keep-fasta write FASTA files in addition to sketching. Default: do not write FASTA files
--download-only just download genomes; do not sketch
Expand Down Expand Up @@ -186,7 +194,8 @@ options:
-o OUTPUT, --output OUTPUT
output zip file for the signatures
--batch-size BATCH_SIZE
Write smaller zipfiles, each containing sigs associated with this number of accessions. This allows urlsketch to recover after unexpected failures, rather than needing to
Write smaller zipfiles, each containing sigs associated with this number of urls.
This allows urlsketch to recover after unexpected failures, rather than needing to
restart sketching from scratch. Default: write all sigs to single zipfile.
-f FASTAS, --fastas FASTAS
Write fastas here
Expand Down
11 changes: 1 addition & 10 deletions tests/test_gbsketch.py
Original file line number Diff line number Diff line change
Expand Up @@ -276,21 +276,13 @@ def test_gbsketch_save_fastas(runtmp):
else:
assert sig.md5sum() == ss3.md5sum()


def test_gbsketch_download_only(runtmp, capfd):
acc_csv = get_test_data('acc.csv')
output = runtmp.output('simple.zip')
failed = runtmp.output('failed.csv')
out_dir = runtmp.output('out_fastas')
ch_fail = runtmp.output('checksum_dl_failed.csv')


sig1 = get_test_data('GCA_000175535.1.sig.gz')
sig2 = get_test_data('GCA_000961135.2.sig.gz')
sig3 = get_test_data('GCA_000961135.2.protein.sig.gz')
ss1 = sourmash.load_one_signature(sig1, ksize=31)
ss2 = sourmash.load_one_signature(sig2, ksize=31)
ss3 = sourmash.load_one_signature(sig3, ksize=30, select_moltype='protein')

runtmp.sourmash('scripts', 'gbsketch', acc_csv, '--download-only',
'--failed', failed, '-r', '1', '--fastas', out_dir, '--keep-fasta',
'--checksum-fail', ch_fail,
Expand Down Expand Up @@ -643,7 +635,6 @@ def test_gbsketch_simple_batched_multiple(runtmp, capfd):
sig3 = get_test_data('GCA_000961135.2.protein.sig.gz')
ss1 = sourmash.load_one_signature(sig1, ksize=31)
ss2 = sourmash.load_one_signature(sig2, ksize=31)
# why does this need ksize =30 and not ksize = 10!???
ss3 = sourmash.load_one_signature(sig3, ksize=30, select_moltype='protein')

runtmp.sourmash('scripts', 'gbsketch', acc_csv, '-o', output,
Expand Down

0 comments on commit bd24b26

Please sign in to comment.