Skip to content

Latest commit

 

History

History
281 lines (188 loc) · 16.3 KB

README.md

File metadata and controls

281 lines (188 loc) · 16.3 KB

The branchwater plugin for sourmash

command functionality docs
manysketch Rapidly build sketches for many input files link
fastgather Multithreaded gather of one metagenome against a database link
fastmultigather Multithreaded gather of multiple metagenomes against a database link
manysearch Multithreaded containment search for many queries in many large metagenomes link
multisearch Multithreaded comparison of multiple sketches, in memory link
pairwise Multithreaded pairwise comparison of multiple sketches, in memory link
cluster cluster sequences based on similarity data from pairwise or multisearch link

This repository implements multithreaded plugins for sourmash that provide very fast implementations of sketch, search, and gather. These commands are typically hundreds to thousands of times faster, and 10-50x lower memory, than the current sourmash code. For example, a gather of SRR606249 with sourmash v4.8.6 against GTDB rs214 takes 40 minutes and 14 GB of RAM, while fastgather with 64 cores takes only 2 minutes and 2 GB of RAM.

The main drawback to these plugin commands is that their inputs and outputs are different (and sometimes not as rich as) the native sourmash commands. In particular, this means that your input files may need to be prepared differently, and the output may need to be processed differently.

Note: As of v0.9.5, the outputs of fastgather and fastmultigather almost completely match the output of sourmash gather; see below for details.

Input file formats

sourmash supports a variety of different storage formats for sketches (see sourmash docs), and the branchwater plugin works with some (but not all) of them. Branchwater also supports an additional database type, a RocksDB-based inverted index, that is not yet supported by sourmash (through v4.8.9).

command query input database format
manysketch CSV with input fasta/fastq paths (details below) produces Zip database
gather Single metagenome in sig, zip, or fromfile Zip or fromfile
fastmultigather Multiple metagenomes in sig, zip, or fromfile Zip, fromfile, or rocksdb index
manysearch Multiple genomes in sig, zip, or fromfile Zip, fromfile, or rocksdb index
multisearch Multiple sketches in sig, zip, or fromfile Multiple sketches in sig, zip, or fromfile
pairwise Multiple sketches in sig, zip, or fromfile N/A
cluster Output from pairwise or multisearch N/A

Using zipfiles

When working with large collections of small sketches such as genomes, we suggest using zipfiles as produced by sourmash (e.g. using sourmash sig cat or manysketch). Zip files have a few nice features:

  • sketches are compressed in zip files;
  • zip files can contain many sketches, including incompatible types (e.g. multiple k-mer sizes);
  • zip files contain "manifests" listing their contents;
  • subsets of zip files can be efficiently selected and loaded depending on what is needed;
  • in particular, single sketches can be loaded on demand, supporting lower memory requirements for certain kinds of searches.

For all these reasons, zip files are the most efficient and effective basic storage type for sketches in sourmash, and as of the branchwater plugin v0.9.0, they are fully supported!

You can create zipfiles with sourmash like so:

sourmash sig cat <list of sketches> -o sigs.zip

Using RocksDB inverted indexes

The branchwater plugin also supports a database type that is not yet supported by sourmash: inverted indexes stored in a RocksDB database. These indexes provide fast and low-memory lookups when searching very large datasets, and are used for the branchwater petabase scale search hosted at branchwater.sourmash.bio.

Some commands - fastmultigather and manysearch - support using these RocksDB-based inverted indexes. They can be created by running sourmash scripts index.

Using "fromfiles"

You can make a fromfile by listing a collection of .sig.gz files like so:

find /path/to/directory/ -name "*.sig.gz" -type f > directory.txt

When using a fromfile for search, we load all signatures into memory at the start in order to generate a manifest. To avoid memory issues, the signatures are not kept in memory, but instead re-loaded as described below for each command (see: Notes on concurrency and efficiency). This makes using fromfiles less efficient than zip files (as of v0.9.0).

Running the commands

Running manysketch

The manysketch command sketches one or more FASTA/FASTQ files into a zipped sourmash signature collection (zip). manysketch uses one thread per input file, so it can (very) efficiently sketch many files at once; and, because sequence file parsing is entirely implemented in Rust, it is much, much faster than sourmash sketch for large FASTQ files. However, it does not currently support translation, i.e. protein signature generation from DNA FASTA.

specifying input FASTA

To run manysketch, you need to build a text file list of FASTA/FASTQ files (see manysketch.csv example, below).

The following formats are accepted:

  • 3 columns: name,genome_filename,protein_filename

    genome_filename entries are considered DNA FASTA, protein_filename entries are considered protein FASTA.

  • 3 columns: name,read1,read2

    All entries considered DNA FASTA, and both read1 and read2 files are used as input for a single sketch with name name.

  • 4 columns: name,input_moltype,prefix,exclude

    This filetype uses glob to find files that match prefix but do not match exclude. As such, * are ok in the prefix and exclude columns. Since we are dealing with "prefixes" here, we automatically search with * on the end of the prefix entry.

A simple way to build a manysketch input file for a directory is this command snippet:

echo name,genome_filename,protein_filename > manysketch.csv
for i in *.fa.gz
do
echo $i,$i,
done >> manysketch.csv

You can then run:

sourmash scripts manysketch manysketch.csv -o fa.zip

The output will be written to fa.zip

You can check if all signatures were written properly with

sourmash sig summarize fa.zip

To modify sketching parameters, use --param-str or -p and provide valid param string(s)

sourmash scripts manysketch fa.csv -o fa.zip -p k=21,k=31,k=51,scaled=1000,abund -p protein,k=10,scaled=200

See the sourmash sketch docs for more information on param strings.

singleton sketching

manysketch also supports building independent sketches for each record in a FASTA file (--singleton).

You can run:

sourmash scripts manysketch manysketch.csv -o fa.zip --singleton

The output will be written to fa.zip

You can check if all signatures were written properly with

sourmash sig summarize fa.zip

The number of sketches per parameter combination should equal the total number of records in all input FASTA. The name column will not be used. Instead, each sketch will be named from the FASTA record name.

Running multisearch and pairwise

The multisearch command compares one or more query genomes, and one or more subject genomes. It differs from manysearch by loading all genomes into memory.

multisearch takes two input collections and outputs a CSV:

sourmash scripts multisearch query.sig.gz database.zip -o results.csv

The results file results.csv, will have 8 columns: query and query_md5, match and match_md5, and containment, jaccard, max_containment, and intersect_hashes.

The pairwise command does the same comparisons as multisearch but takes only a single collection of sketches, for which it calculates all the pairwise comparisons. Since the comparisons are symmetric, it is approximately twice as fast as multisearch.

Running fastgather

The fastgather command is a much faster version of sourmash gather.

fastgather takes a single query metagenome and a database, and outputs a CSV:

sourmash scripts fastgather query.sig.gz database.zip -o results.csv --cores 4

As of v0.9.5, fastgather outputs the same columns as sourmash gather, with only a few exceptions:

  • match_name is output instead of name;
  • match_md5 is output instead of md5;
  • match_filename is output instead of filename, and the value is different;
  • potential_false_negative is not present in fastgather output;

Running fastmultigather

fastmultigather takes a collection of query metagenomes and a collection of sketches as a database, and outputs many CSVs:

sourmash scripts fastmultigather queries.manifest.csv database.zip --cores 4

The main advantage that fastmultigather has over running fastgather on multiple queries is that fastmultigather only needs to load the database once for all queries, unlike with fastgather; this can be a significant time savings for large databases!

Output files for fastmultigather

On a database of sketches (but not on RocksDB indexes) fastmultigather will output two CSV files for each query, a prefetch file containing all overlapping matches between that query and the database, and a gather file containing the minimum metagenome cover for that query in the database.

The prefetch CSV will be named {signame}.prefetch.csv, and the gather CSV will be named {signame}.gather.csv. Here, {signame} is the name of your sourmash signature.

When searching against a RocksDB index, fastmultigather will output a single file containing all gather results, specified with -o/--output. No prefetch results will be output.

fastmultigather gather CSVs provide the same columns as fastgather, above.

Warning: At the moment, if two different queries have the same {signame}, the CSVs for one of the queries will be overwritten by the other query. The behavior here is undefined in practice, because of multithreading: we don't know what queries will be executed when or files will be written first.

Running manysearch

The manysearch command compares one or more collections of query sketches, and one or more collections of subject sketches. It is the core command we use for searching petabase-scale databases of metagenomes for contained genomes.

manysearch takes two collections as input, and outputs a CSV:

sourmash scripts manysearch queries.zip metagenomes.manifest.csv -o results.csv

The results file here, query.x.gtdb-reps.csv, will have 8 columns: query and query_md5, match and match_md5, and containment, jaccard, max_containment, and intersect_hashes.

Running cluster

The cluster command conducts graph-based clustering via the sequence similarity measures in pairwise or multisearch outputs. It is a new command and we are exploring its utility.

cluster takes the csv output of pairwise or multisearch input, and outputs two CSVs:

  1. -o, --output will contain the names of the clusters and the ident of each sequence included in the cluster (e.g. Component_1, name1;name2)
cluster,nodes
Component_1,name1;name2;name3
Component_2,name4
  1. --cluster-sizes will contain information on cluster size, with a counts for the number of clusters of that size. For the two clusters above, the counts would look like this:
cluster_size,count
3,1
1,1

cluster takes a --similarity_column argument to specify which of the similarity columns, with the following choices: containment, max_containment, jaccard, average_containment_ani, maximum_containment_ani. All values should be input as fractions (e.g. 0.9 for 90%)

Notes on concurrency and efficiency

Each command does things slightly differently, with implications for CPU and disk load. You can measure threading efficiency with /usr/bin/time -v on Linux systems, and disk load by number of complaints received when running.

manysketch loads one sequence file from disk per thread and sketches it using all signature params simultaneously.

manysearch loads all the queries at the beginning, and then loads one database sketch from disk per thread. The compute-per-database-sketch is dominated by I/O. So your number of threads should be chosen with care for disk load. We typically limit it to -c 32 for shared disks.

multisearch loads all the queries and database sketches once, at the beginning, and then uses multithreading to search across all matching sequences. For large databases it is extremely efficient at using all available cores. So 128 threads or more should work fine! Zipfiles should work well.

pairwise acts just like multisearch, but only loads one file (and then does all comparisons between all pairs within that file).

Like multisearch and pairwise, fastgather loads everything at the beginning, and then uses multithreading to search across all matching sequences. For large databases it is extremely efficient at using all available cores. So 128 threads or more should work fine! We suggest using zipfiles for the database.

fastmultigather loads the entire database once, and then loads one query from disk per thread. The compute-per-query can be significant, though, so multithreading efficiency here is less dependent on I/O and the disk is less likely to be saturated with many threads. We suggest limiting threads to between 32 and 64 to decrease shared disk load.

cluster loads the entire file multithreaded, and then populates the graph sequentially.

Appendix 1 - index to create a low-memory index

The command sourmash scripts index makes an on-disk inverted index for low memory fast search. Indexing takes a while, but then search takes fewer resources.

Currently only fastmultigather and manysearch can use this kind of index.

fastmultigather with this index produces a complete set of sourmash gather columns.

We suggest using the extension .rocksdb for these databases, as we use RocksDB for the underlying database storage mechanism.