Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: enable multithreaded sketching to zip file (manysketch) #88

Merged
merged 24 commits into from
Sep 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ simple-error = "0.3.0"
anyhow = "1.0.75"
zip = "0.6"
tempfile = "3.8"
needletail = "0.5.1"
csv = "1.2.2"

[dev-dependencies]
assert_cmd = "2.0.4"
Expand Down
31 changes: 27 additions & 4 deletions doc/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,16 @@
# fastgather, fastmultigather, and manysearch - an introduction

This repository implements four sourmash plugins, `fastgather`, `fastmultigather`, `multisearch`, and `manysearch`. These plugins make use of multithreading in Rust to provide very fast implementations of `search` and `gather`. With large databases, these commands can be hundreds to thousands of times faster, and 10-50x lower memory.
This repository implements five sourmash plugins, `manysketch`, `fastgather`, `fastmultigather`, `multisearch`, and `manysearch`. These plugins make use of multithreading in Rust to provide very fast implementations of `sketch`, `search`, and `gather`. With large databases, these commands can be hundreds to thousands of times faster, and 10-50x lower memory.

The main *drawback* to these plugin commands is that their inputs and outputs are not as rich as the native sourmash commands. In particular, this means that input databases need to be prepared differently, and the output may be most useful as a prefilter in conjunction with regular sourmash commands.

## Preparing the database

All four commands use
_text files containing lists of signature files_, or "fromfiles", for the search database, and `multisearch`, `manysearch` and `fastmultigather` use "fromfiles" for queries, too.
`manysketch` requires a `fromfile` csv with columns `name,genome_filename,protein_filename`. If you don't have protein_filenames, be sure to include the trailing comma so the csv reader can process the file correctly. All four search commands use _text files containing lists of signature files_, or "fromfiles" for the search database. `multisearch`, `manysearch` and `fastmultigather` also use "fromfiles" for queries, too.

(Yes, this plugin will eventually be upgraded to support zip files; keep an eye on [sourmash#2230](https://github.com/sourmash-bio/sourmash/pull/2230).)

To prepare a fromfile from a database, first you need to split the database into individual files:
To prepare a **signature** fromfile from a database, first you need to split the database into individual files:
```
mkdir gtdb-reps-rs214-k21/
cd gtdb-reps-rs214-k21/
Expand All @@ -26,6 +25,28 @@ find gtdb-reps-rs214-k21/ -name "*.sig.gz" -type f > list.gtdb-reps-rs214-k21.tx

## Running the commands

### Running `manysketch`

The `manysketch` command sketches one or more fastas into a zipped sourmash signature collection (`zip`).

To run `manysketch`, you need to build a text file list of fasta files, with one on each line (`fa.csv`, below). You can then run:

```
sourmash scripts manysketch fa.csv -o fa.zip
```
The output will be written to `fa.zip`

You can check if all signatures were written properly with
```
sourmash sig summarize fa.zip
```

To modify sketching parameters, use `--param-str` or `-p` and provide valid param string(s)
```
sourmash scripts manysketch fa.csv -o fa.zip -p k=21,k=31,k=51,scaled=1000,abund -p protein,k=10,scaled=200
```


### Running `multisearch`

The `multisearch` command compares one or more query genomes, and one or more subject genomes. It differs from `manysearch` by loading all genomes into memory.
Expand Down Expand Up @@ -127,6 +148,8 @@ Each command does things slightly differently, with implications for CPU and dis

(The below info is for fromfile lists. If you are using mastiff indexes, very different performance parameters apply. We will update here as we benchmark and improve!)

`manysketch` loads one fasta file from disk per thread and sketches it using all signature params simultaneously.

`manysearch` loads all the queries at the beginning, and then loads one database sketch from disk per thread. The compute-per-database-sketch is dominated by I/O. So your number of threads should be chosen with care for disk load. We typically limit it to `-c 32` for shared disks.

`multisearch` loads all the queries and database sketches once, at the beginning, and then uses multithreading to search across all matching sequences. For large databases it is extremely efficient at using all available cores. So 128 threads or more should work fine!
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ fastgather = "pyo3_branchwater:Branchwater_Fastgather"
fastmultigather = "pyo3_branchwater:Branchwater_Fastmultigather"
index = "pyo3_branchwater:Branchwater_Index"
check = "pyo3_branchwater:Branchwater_Check"
manysketch = "pyo3_branchwater:Branchwater_Manysketch"


[tool.maturin]
Expand Down
Loading
Loading