command | functionality | docs |
---|---|---|
manysketch |
Rapidly build sketches for many input files | link |
fastgather |
Multithreaded gather of one metagenome against a database |
link |
fastmultigather |
Multithreaded gather of multiple metagenomes against a database |
link |
manysearch |
Multithreaded containment search for many queries in many large metagenomes | link |
multisearch |
Multithreaded comparison of multiple sketches, in memory | link |
pairwise |
Multithreaded pairwise comparison of multiple sketches, in memory | link |
cluster |
cluster sequences based on similarity data from pairwise or multisearch |
link |
This repository implements multithreaded plugins for sourmash that provide very fast implementations of sketch
, search
, and gather
. These commands are typically hundreds to thousands of times faster, and 10-50x lower memory, than the current sourmash code. For example, a gather
of SRR606249 with sourmash v4.8.6 against GTDB rs214 takes 40 minutes and 14 GB of RAM, while fastgather
with 64 cores takes only 2 minutes and 2 GB of RAM.
The main drawback to these plugin commands is that their inputs and outputs are different (and sometimes not as rich as) the native sourmash commands. In particular, this means that your input files may need to be prepared differently, and the output may need to be processed differently.
Note: As of v0.9.5, the outputs of fastgather
and fastmultigather
almost completely match the output of sourmash gather
; see below for details.
sourmash supports a variety of different storage formats for sketches (see sourmash docs), and the branchwater plugin works with some (but not all) of them. Branchwater also supports an additional database type, a RocksDB-based inverted index, that is not yet supported by sourmash (through v4.8.9).
command | query input | database format |
---|---|---|
manysketch |
CSV with input fasta/fastq paths (details below) | produces Zip database |
gather |
Single metagenome in sig, zip, or fromfile | Zip or fromfile |
fastmultigather |
Multiple metagenomes in sig, zip, or fromfile | Zip, fromfile, or rocksdb index |
manysearch |
Multiple genomes in sig, zip, or fromfile | Zip, fromfile, or rocksdb index |
multisearch |
Multiple sketches in sig, zip, or fromfile | Multiple sketches in sig, zip, or fromfile |
pairwise |
Multiple sketches in sig, zip, or fromfile | N/A |
cluster |
Output from pairwise or multisearch |
N/A |
When working with large collections of small sketches such as genomes, we suggest using zipfiles as produced by sourmash (e.g. using sourmash sig cat
or manysketch
). Zip files have a few nice features:
- sketches are compressed in zip files;
- zip files can contain many sketches, including incompatible types (e.g. multiple k-mer sizes);
- zip files contain "manifests" listing their contents;
- subsets of zip files can be efficiently selected and loaded depending on what is needed;
- in particular, single sketches can be loaded on demand, supporting lower memory requirements for certain kinds of searches.
For all these reasons, zip files are the most efficient and effective basic storage type for sketches in sourmash, and as of the branchwater plugin v0.9.0, they are fully supported!
You can create zipfiles with sourmash like so:
sourmash sig cat <list of sketches> -o sigs.zip
The branchwater plugin also supports a database type that is not yet supported by sourmash: inverted indexes stored in a RocksDB database. These indexes provide fast and low-memory lookups when searching very large datasets, and are used for the branchwater petabase scale search hosted at branchwater.sourmash.bio.
Some commands - fastmultigather
and manysearch
- support using these RocksDB-based inverted indexes. They can be created by running sourmash scripts index
.
You can make a fromfile by listing a collection of .sig.gz files like so:
find /path/to/directory/ -name "*.sig.gz" -type f > directory.txt
When using a fromfile for search, we load all signatures into memory at the start in order to generate a manifest. To avoid memory issues, the signatures are not kept in memory, but instead re-loaded as described below for each command (see: Notes on concurrency and efficiency). This makes using fromfiles less efficient than zip
files (as of v0.9.0).
The manysketch
command sketches one or more FASTA/FASTQ files into a zipped sourmash signature collection (zip
). manysketch
uses one thread per input file, so it can (very) efficiently sketch many files at once; and, because sequence file parsing is entirely implemented in Rust, it is much, much faster than sourmash sketch
for large FASTQ files. However, it does not currently support translation, i.e. protein signature generation from DNA FASTA.
To run manysketch
, you need to build a text file list of FASTA/FASTQ files (see manysketch.csv
example, below).
The following formats are accepted:
- 3 columns:
name,genome_filename,protein_filename
genome_filename
entries are considered DNA FASTA,protein_filename
entries are considered protein FASTA. - 3 columns:
name,read1,read2
All entries considered DNA FASTA, and both
read1
andread2
files are used as input for a single sketch with namename
. - 4 columns:
name,input_moltype,prefix,exclude
This filetype uses
glob
to find files that matchprefix
but do not matchexclude
. As such,*
are ok in theprefix
andexclude
columns. Since we are dealing with "prefixes" here, we automatically search with*
on the end of theprefix
entry.
A simple way to build a manysketch input file for a directory is this command snippet:
echo name,genome_filename,protein_filename > manysketch.csv
for i in *.fa.gz
do
echo $i,$i,
done >> manysketch.csv
You can then run:
sourmash scripts manysketch manysketch.csv -o fa.zip
The output will be written to fa.zip
You can check if all signatures were written properly with
sourmash sig summarize fa.zip
To modify sketching parameters, use --param-str
or -p
and provide valid param string(s)
sourmash scripts manysketch fa.csv -o fa.zip -p k=21,k=31,k=51,scaled=1000,abund -p protein,k=10,scaled=200
See the sourmash sketch docs for more information on param strings.
manysketch
also supports building independent sketches for each record in a FASTA file (--singleton
).
You can run:
sourmash scripts manysketch manysketch.csv -o fa.zip --singleton
The output will be written to fa.zip
You can check if all signatures were written properly with
sourmash sig summarize fa.zip
The number of sketches per parameter combination should equal the total number of records in all input FASTA.
The name
column will not be used. Instead, each sketch will be named from the FASTA record name.
The multisearch
command compares one or more query genomes, and one or more subject genomes. It differs from manysearch
by loading all genomes into memory.
multisearch
takes two input collections and outputs a CSV:
sourmash scripts multisearch query.sig.gz database.zip -o results.csv
The results file results.csv
, will have 8 columns: query
and query_md5
, match
and match_md5
, and containment
, jaccard
, max_containment
, and intersect_hashes
.
The pairwise
command does the same comparisons as multisearch
but takes
only a single collection of sketches, for which it calculates all the pairwise comparisons. Since the comparisons are symmetric, it is approximately
twice as fast as multisearch
.
The fastgather
command is a much faster version of sourmash gather
.
fastgather
takes a single query metagenome and a database, and outputs a CSV:
sourmash scripts fastgather query.sig.gz database.zip -o results.csv --cores 4
As of v0.9.5, fastgather
outputs the same columns as sourmash gather
, with only a few exceptions:
match_name
is output instead ofname
;match_md5
is output instead ofmd5
;match_filename
is output instead offilename
, and the value is different;potential_false_negative
is not present infastgather
output;
fastmultigather
takes a collection of query metagenomes and a collection of sketches as a database, and outputs many CSVs:
sourmash scripts fastmultigather queries.manifest.csv database.zip --cores 4
The main advantage that fastmultigather
has over running fastgather
on multiple queries is that fastmultigather
only needs to load the database once for all queries, unlike with fastgather
; this can be a significant time savings for large databases!
On a database of sketches (but not on RocksDB indexes) fastmultigather
will output two CSV files for each query, a prefetch
file containing all overlapping matches between that query and the database, and a gather
file containing the minimum metagenome cover for that query in the database.
The prefetch CSV will be named {signame}.prefetch.csv
, and the gather CSV will be named {signame}.gather.csv
. Here, {signame}
is the name of your sourmash signature.
When searching against a RocksDB index, fastmultigather
will output a single file containing all gather results, specified with -o/--output
. No prefetch results will be output.
fastmultigather
gather CSVs provide the same columns as fastgather
, above.
Warning: At the moment, if two different queries have the same {signame}
, the CSVs for one of the queries will be overwritten by the other query. The behavior here is undefined in practice, because of multithreading: we don't know what queries will be executed when or files will be written first.
The manysearch
command compares one or more collections of query sketches, and one or more collections of subject sketches. It is the core command we use for searching petabase-scale databases of metagenomes for contained genomes.
manysearch
takes two collections as input, and outputs a CSV:
sourmash scripts manysearch queries.zip metagenomes.manifest.csv -o results.csv
The results file here, query.x.gtdb-reps.csv
, will have 8 columns: query
and query_md5
, match
and match_md5
, and containment
, jaccard
, max_containment
, and intersect_hashes
.
The cluster
command conducts graph-based clustering via the sequence similarity measures in pairwise
or multisearch
outputs. It is a new command and we are exploring its utility.
cluster
takes the csv output of pairwise
or multisearch
input, and outputs two CSVs:
-o
,--output
will contain the names of the clusters and theident
of each sequence included in the cluster (e.g.Component_1, name1;name2
)
cluster,nodes
Component_1,name1;name2;name3
Component_2,name4
--cluster-sizes
will contain information on cluster size, with a counts for the number of clusters of that size. For the two clusters above, the counts would look like this:
cluster_size,count
3,1
1,1
cluster
takes a --similarity_column
argument to specify which of the similarity columns, with the following choices: containment
, max_containment
, jaccard
, average_containment_ani
, maximum_containment_ani
. All values should be input as fractions (e.g. 0.9 for 90%)
Each command does things slightly differently, with implications for CPU and disk load. You can measure threading efficiency with /usr/bin/time -v
on Linux systems, and disk load by number of complaints received when running.
manysketch
loads one sequence file from disk per thread and sketches it using all signature params simultaneously.
manysearch
loads all the queries at the beginning, and then loads one database sketch from disk per thread. The compute-per-database-sketch is dominated by I/O. So your number of threads should be chosen with care for disk load. We typically limit it to -c 32
for shared disks.
multisearch
loads all the queries and database sketches once, at the beginning, and then uses multithreading to search across all matching sequences. For large databases it is extremely efficient at using all available cores. So 128 threads or more should work fine! Zipfiles should work well.
pairwise
acts just like multisearch
, but only loads one file (and then does all comparisons between all pairs within that file).
Like multisearch
and pairwise
, fastgather
loads everything at the beginning, and then uses multithreading to search across all matching sequences. For large databases it is extremely efficient at using all available cores. So 128 threads or more should work fine! We suggest using zipfiles for the database.
fastmultigather
loads the entire database once, and then loads one query from disk per thread. The compute-per-query can be significant, though, so multithreading efficiency here is less dependent on I/O and the disk is less likely to be saturated with many threads. We suggest limiting threads to between 32 and 64 to decrease shared disk load.
cluster
loads the entire file multithreaded, and then populates the graph sequentially.
The command sourmash scripts index
makes an on-disk inverted index
for low memory fast search. Indexing takes a while, but then search
takes fewer resources.
Currently only fastmultigather
and manysearch
can use this kind of index.
fastmultigather
with this index produces a complete set of sourmash gather
columns.
We suggest using the extension .rocksdb
for these databases, as we
use RocksDB for the underlying database storage
mechanism.