Skip to content

Commit

Permalink
add some usage docs
Browse files Browse the repository at this point in the history
  • Loading branch information
bluegenes committed Sep 5, 2023
1 parent ed93fb1 commit 2d9d37f
Showing 1 changed file with 29 additions and 4 deletions.
33 changes: 29 additions & 4 deletions doc/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,18 @@
# fastgather, fastmultigather, and manysearch - an introduction

This repository implements four sourmash plugins, `fastgather`, `fastmultigather`, `multisearch`, and `manysearch`. These plugins make use of multithreading in Rust to provide very fast implementations of `search` and `gather`. With large databases, these commands can be hundreds to thousands of times faster, and 10-50x lower memory.
This repository implements five sourmash plugins, `manysketch`, `fastgather`, `fastmultigather`, `multisearch`, and `manysearch`. These plugins make use of multithreading in Rust to provide very fast implementations of `sketch`, `search`, and `gather`. With large databases, these commands can be hundreds to thousands of times faster, and 10-50x lower memory.

The main *drawback* to these plugin commands is that their inputs and outputs are not as rich as the native sourmash commands. In particular, this means that input databases need to be prepared differently, and the output may be most useful as a prefilter in conjunction with regular sourmash commands.

## Preparing the database

All four commands use
_text files containing lists of signature files_, or "fromfiles", for the search database, and `multisearch`, `manysearch` and `fastmultigather` use "fromfiles" for queries, too.
All five commands use _text files containing lists of files_, or "fromfiles":
- `manysketch` requires a list of **fasta** files.
- the remaining commands require **signature** files for the search database. `multisearch`, `manysearch` and `fastmultigather` also use "fromfiles" for queries, too.

(Yes, this plugin will eventually be upgraded to support zip files; keep an eye on [sourmash#2230](https://github.com/sourmash-bio/sourmash/pull/2230).)

To prepare a fromfile from a database, first you need to split the database into individual files:
To prepare a **signature** fromfile from a database, first you need to split the database into individual files:
```
mkdir gtdb-reps-rs214-k21/
cd gtdb-reps-rs214-k21/
Expand All @@ -26,6 +27,28 @@ find gtdb-reps-rs214-k21/ -name "*.sig.gz" -type f > list.gtdb-reps-rs214-k21.tx

## Running the commands

### Running `manysketch`

The `manysketch` command sketches one or more fastas into a zipped sourmash signature collection (`zip`).

To run `manysketch`, you need to build a text file list of fasta files, with one on each line (`fa.txt`, below). You can then run:

```
sourmash scripts manysketch fa.txt -o fa.zip
```
The output will be written to `fa.zip`

You can check if all signatures were written properly with
```
sourmash sig summarize fa.zip
```

To modify sketching parameters, use `--param-str` or `-p` and provide valid param string(s)
```
sourmash scripts manysketch fa.txt -o fa.zip -p k=21,k=31,k=51,scaled=1000,abund -p protein,k=10,scaled=200
```


### Running `multisearch`

The `multisearch` command compares one or more query genomes, and one or more subject genomes. It differs from `manysearch` by loading everything into memory.
Expand Down Expand Up @@ -127,6 +150,8 @@ Each command does things slightly differently, with implications for CPU and dis

(The below info is for fromfile lists. If you are using mastiff indexes, very different performance parameters apply. We will update here as we benchmark and improve!)

`manysketch` loads one fasta file from disk per thread and sketches it into all signatures types simultaneously.

`manysearch` loads all the queries at the beginning, and then loads one database sketch from disk per thread. The compute-per-database-sketch is dominated by I/O. So your number of threads should be chosen with care for disk load. We typically limit it to `-c 32` for shared disks.

`multisearch` loads all the queries and database sketches once, at the beginning, and then uses multithreading to search across all matching sequences. For large databases it is extremely efficient at using all available cores. So 128 threads or more should work fine!
Expand Down

0 comments on commit 2d9d37f

Please sign in to comment.