From 42b2aae9c7f0481b5ead88348936e54401f193ee Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 12 Aug 2024 15:56:53 -0700 Subject: [PATCH] MRG: update documentation for RocksDB `index` and internal/external storage, + miscellaneous improvements (#416) * buncha doc updates * moar * upd manysketch * update with pretty-print * minor updates * fix typo --- doc/README.md | 233 +++++++++++++++--- .../sourmash_plugin_branchwater/__init__.py | 2 +- 2 files changed, 198 insertions(+), 37 deletions(-) diff --git a/doc/README.md b/doc/README.md index 27f226c4..d81f6448 100644 --- a/doc/README.md +++ b/doc/README.md @@ -9,28 +9,43 @@ | `multisearch` | Multithreaded comparison of multiple sketches, in memory | [link](#Running-multisearch-and-pairwise) | `pairwise` | Multithreaded pairwise comparison of multiple sketches, in memory | [link](#Running-multisearch-and-pairwise) | `cluster` | cluster sequences based on similarity data from `pairwise` or `multisearch` | [link](#Running-cluster) - -This repository implements multithreaded plugins for [sourmash](https://sourmash.readthedocs.io/) that provide very fast implementations of `sketch`, `search`, and `gather`. These commands are typically hundreds to thousands of times faster, and 10-50x lower memory, than the current sourmash code. For example, a `gather` of SRR606249 with sourmash v4.8.6 against GTDB rs214 takes 40 minutes and 14 GB of RAM, while `fastgather` with 64 cores takes only 2 minutes and 2 GB of RAM. - -The main *drawback* to these plugin commands is that their inputs and outputs are different (and sometimes not as rich as) the native sourmash commands. In particular, this means that your input files may need to be prepared differently, and the output may need to be processed differently. +| `index` | build a RocksDB inverted index for efficient containment queries | [link](#Running-index) + +This repository implements multithreaded plugins for +[sourmash](https://sourmash.readthedocs.io/) that provide very fast +implementations of `sketch`, `search`, and `gather`. These commands +are typically hundreds to thousands of times faster, and 10-50x lower +memory, than the current sourmash code. For example, a `gather` of +SRR606249 with sourmash v4.8.6 against GTDB rs214 takes 40 minutes and +14 GB of RAM, while `fastgather` with 64 cores takes only 2 minutes +and 2 GB of RAM. + +The main *drawback* to these plugin commands is that their inputs are +more restricted than the native sourmash commands, and in some cases +their outputs are in a different format. This means that your input +files may need to be prepared differently, and the output may need to +be processed differently. The plugin commands are also a bit less +user friendly, because (for now) we're more focused on speed than +polish and user experience. **Note:** As of v0.9.5, the outputs of `fastgather` and `fastmultigather` almost completely match the output of `sourmash gather`; see below for details. ## Input file formats -sourmash supports a variety of different storage formats for sketches (see [sourmash docs](https://sourmash.readthedocs.io/en/latest/command-line.html#choosing-signature-output-formats)), and the branchwater plugin works with some (but not all) of them. Branchwater _also_ supports an additional database type, a RocksDB-based inverted index, that is not yet supported by sourmash (through v4.8.9). +sourmash supports a variety of different storage formats for sketches (see [sourmash docs](https://sourmash.readthedocs.io/en/latest/command-line.html#choosing-signature-output-formats)), and the branchwater plugin works with some (but not all) of them. Branchwater _also_ supports an additional database type, a RocksDB-based inverted index, that is not (yet) supported natively by sourmash (through v4.8.11). -| command | query input | database format | +| command | command input | database format | | -------- | -------- | -------- | | `manysketch` | CSV with input fasta/fastq paths (details below) | _produces_ Zip database | -| `gather` | Single metagenome in sig, zip, or fromfile | Zip or fromfile | -| `fastmultigather` | Multiple metagenomes in sig, zip, or fromfile | Zip, fromfile, or rocksdb index | -| `manysearch` | Multiple genomes in sig, zip, or fromfile | Zip, fromfile, or rocksdb index | -| `multisearch` | Multiple sketches in sig, zip, or fromfile | Multiple sketches in sig, zip, or fromfile | -| `pairwise` | Multiple sketches in sig, zip, or fromfile | N/A | +| `gather` | Single metagenome in sig, zip, or pathlist | Zip or pathlist | +| `fastmultigather` | Multiple metagenomes in sig, zip, or pathlist | Zip, pathlist, or rocksdb index | +| `manysearch` | Multiple genomes in sig, zip, or pathlist | Zip, pathlist, or rocksdb index | +| `multisearch` | Multiple sketches in sig, zip, or pathlist | Multiple sketches in sig, zip, or pathlist | +| `pairwise` | Multiple sketches in sig, zip, or pathlist | N/A | | `cluster`| Output from `pairwise` or `multisearch`| N/A | +| `index` | Multiple sketches in sig, zip, or pathlist | N/A | ### Using zipfiles @@ -70,7 +85,7 @@ And the answer is: manifests. Manifests are a sourmash filetype that contains in The branchwater plugin supports manifest CSVs. These can be created from lists of sketches by using `sourmash sig collect` or `sourmash sig manifest`; for example, ``` -sourmash sig manifest -o manifest.csv +sourmash sig manifest -o manifest.csv ``` will create a manifest CSV from a list of sketches. --> @@ -79,18 +94,18 @@ will create a manifest CSV from a list of sketches. The branchwater plugin also supports a database type that is not yet supported by sourmash: inverted indexes stored in a RocksDB database. These indexes provide fast and low-memory lookups when searching very large datasets, and are used for the branchwater petabase scale search hosted at [branchwater.sourmash.bio](https://branchwater.sourmash.bio). -Some commands - `fastmultigather` and `manysearch` - support using these RocksDB-based inverted indexes. They can be created by running `sourmash scripts index`. +Some commands - `fastmultigather` and `manysearch` - support using these RocksDB-based inverted indexes. They can be created by running `sourmash scripts index`. See [the `index` documentation, below](#Running-index). -### Using "fromfiles" +### Using "pathlists" - + -You can make a fromfile by listing a collection of .sig.gz files like so: +You can make a pathlist by listing a collection of .sig.gz files like so: ``` find /path/to/directory/ -name "*.sig.gz" -type f > directory.txt ``` -When using a fromfile for search, we load all signatures into memory at the start in order to generate a manifest. To avoid memory issues, the signatures are not kept in memory, but instead re-loaded as described below for each command (see: Notes on concurrency and efficiency). This makes using fromfiles less efficient than `zip` files (as of v0.9.0). +When using a pathlist for search, we load all signatures into memory at the start in order to generate a manifest. To avoid memory issues, the signatures are not kept in memory, but instead re-loaded as described below for each command (see: Notes on concurrency and efficiency). This makes using pathlists less efficient than `zip` files (as of v0.9.0). @@ -99,9 +114,19 @@ When using a fromfile for search, we load all signatures into memory at the star ### Running `manysketch` -The `manysketch` command sketches one or more FASTA/FASTQ files into a zipped sourmash signature collection (`zip`). `manysketch` uses one thread per input file, so it can (very) efficiently sketch many files at once; and, because sequence file parsing is entirely implemented in Rust, it is much, _much_ faster than `sourmash sketch` for large FASTQ files. However, it does not currently support translation, i.e. protein signature generation from DNA FASTA. +The `manysketch` command sketches a list of FASTA/FASTQ files into a +zipped sourmash signature collection (`zip`). `manysketch` uses one +thread per input file, so it can (very) efficiently sketch many files +at once; and, because sequence file parsing is entirely implemented in +Rust, it is much, _much_ faster than `sourmash sketch` for large FASTQ +files. However, it does not currently support translation, +i.e. protein signature generation from DNA FASTA. -#### specifying input FASTA +`manysketch` can be used to sketch collections of genomes, pairs of R1/R2 +files, and more flexible collections based on prefix - which approach is +used depends on the columns provided to `manysketch` in the input CSV file. + +#### Specifying input FASTA To run `manysketch`, you need to build a text file list of FASTA/FASTQ files (see `manysketch.csv` example, below). @@ -174,6 +199,10 @@ The `pairwise` command does the same comparisons as `multisearch` but takes only a single collection of sketches, for which it calculates all the pairwise comparisons. Since the comparisons are symmetric, it is approximately twice as fast as `multisearch`. +The `-t/--threshold` for `multisearch` and `pairwise` applies to the +containment of query-in-target and defaults to 0.01. To report +_any_ overlap between two sketches, set the threshold to 0. + ### Running `fastgather` The `fastgather` command is a much faster version of `sourmash gather`. @@ -221,10 +250,33 @@ sourmash scripts manysearch queries.zip metagenomes.manifest.csv -o results.csv ``` -The results file here, `query.x.gtdb-reps.csv`, will have the following columns: `query`, `query_md5`, `match_name`, `match_md5`, `containment`, `jaccard`, `max_containment`, `intersect_hashes`, `query_containment_ani`. +The results file here, `query.x.gtdb-reps.csv`, will have the +following columns: `query`, `query_md5`, `match_name`, `match_md5`, +`containment`, `jaccard`, `max_containment`, `intersect_hashes`, +`query_containment_ani`. -If you run `manysearch` _without_ using a rocksdb database, the results file will also have the following columns: `average_abund`, `median_abund`, `std_abund`, `match_containment_ani`, `average_containment_ani`, `max_containment_ani`. If the query sketches were not built with abundance tracking enabled, `average_abund` and `median_abund` will default to `1.0`; `std_abund` will default to `0.0`. +If you run `manysearch` _without_ using a RocksDB database (that is, +against regular sketches), the results file will also have the +following columns: , `match_containment_ani`, +`average_containment_ani`, and `max_containment_ani`. +Finally, if using sketches that have abundance information, the +results file will also contain the following columns: `average_abund`, +`median_abund`, `std_abund`, `n_weighted_found`, and `total_weighted_hashes`. + +See +[the prefetch CSV output column documentation](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html#appendix-e-prefetch-csv-output-columns) +for information on these various columns. + +The `-t/--threshold` for `manysearch` applies to the containment of +query-in-database (e.g. genome in metagenome) and defaults to 0.01. +To report _any_ overlap between two sketches, set the threshold to 0. +(This will produce many, many results when searching a collection of +metagenomes!) + +By default, `manysearch` will display the contents of the CSV file in a +human-readable format. This can be disabled with `-N/--no-pretty-print` +when executing large searches. ### Running `cluster` @@ -250,23 +302,132 @@ cluster_size,count `cluster` takes a `--similarity_column` argument to specify which of the similarity columns, with the following choices: `containment`, `max_containment`, `jaccard`, `average_containment_ani`, `maximum_containment_ani`. All values should be input as fractions (e.g. 0.9 for 90%) -## Notes on concurrency and efficiency +### Running `index` + +The `index` subcommand creates a RocksDB inverted index that can be +used as a database for `manysearch` (containment queries into +mixtures) and `fastmultigather` (mixture decomposition against a +database of genomes). + +RocksDB inverted indexes support fast, low-latency, and low-memory +queries, in exchange for a time-intensive indexing step and extra disk +space. They are also used for the +[branchwater.sourmash.bio](https://branchwater.sourmash.bio/) +real-time SRA metagenome query. + +Note that RocksDB indexes do not support abundance estimation for +containment queries. + +To run `index`, provide it with multiple sketches in sig, zip, or +pathlist format, and specify the desired output directory; we suggest +using the `.rocksdb` extension for RocksDB databases, e.g. `-o +gtdb-rs214-k31.rocksdb`. + +By default, as of v0.9.7, `index` will store a copy of the sketches +along with the inverted index. This will substantially increase the +disk space required for large databases. You can provide an optional +`--no-internal-storage` to `index` to store them externally, which +reduces the disk space needed for the index. Read below for technical +details! + +#### Internal vs external storage of sketches in a RocksDB index + +(The below applies to v0.9.7 and later of the plugin; for v0.9.6 and +before, only external storage was implemented.) + +RocksDB indexes support containment queries (a la the +[branchwater application](https://github.com/sourmash-bio/branchwater)), +as well as `gather`-style mixture decomposition (see +[Irber et al., 2022](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2)). +For this plugin, the `manysearch` command supports a RocksDB index for +the database for containment queries, and `fastmultigather` can use a +RocksDB index for the database of genomes. + +RocksDB indexes contain references to the sketches used to construct +the index. If `--internal-storage` is set (which is the default), a +copy of the sketches is stored within the RocksDB database directory; +if `--no-internal-storage` is provided, then the references point to +the original source sketches used to construct the database, wherever +they reside on your disk. + +The sketches *are not used* by `manysearch`, but *are used* by +`fastmultigather`: with v0.9.6 and later, you'll get an error if you +run `fastmultigather` against a RocksDB index where the sketches +cannot be loaded. + +What this means is therefore a bit complicated, but boils down to +the following two approaches: + +1. The safest thing to do is build a RocksDB index and use internal +storage (the default). This will consume more disk space but your +RocksDB database will always be usable for both `manysearch` and +`fastmultigather`, as well as the branchwater app. +2. If you want to avoid storing duplicate copies of your sketches, +then specify `--no-internal-storage` and provide a stable absolute +path to the source sketches. This will again support both +`manysearch` and `fastmultigather`, as well as the branchwater app. +If the source sketches later become unavailable, `fastmultigather` +will stop working (although `manysearch` and the branchwater app +should be fine). + +You should (for the moment) avoid specifying relative paths to the +sketches when running `index`. Follow +[sourmash_branchwater_plugin#415](https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/415) +if better support for relative paths is of interest! + +#### Links and more materials + +Note that RocksDB indexes are implemented in the core +[sourmash Rust library](https://crates.io/crates/sourmash), and +used in downstream software packages (this plugin, and +[the branchwater application code](https://github.com/sourmash-bio/branchwater)). The above documentation applies to sourmash core v0.15.0. -Each command does things slightly differently, with implications for CPU and disk load. You can measure threading efficiency with `/usr/bin/time -v` on Linux systems, and disk load by number of complaints received when running. - -`manysketch` loads one sequence file from disk per thread and sketches it using all signature params simultaneously. - -`manysearch` loads all the queries at the beginning, and then loads one database sketch from disk per thread. The compute-per-database-sketch is dominated by I/O. So your number of threads should be chosen with care for disk load. We typically limit it to `-c 32` for shared disks. - -`multisearch` loads all the queries and database sketches once, at the beginning, and then uses multithreading to search across all matching sequences. For large databases it is extremely efficient at using all available cores. So 128 threads or more should work fine! Zipfiles should work well. - -`pairwise` acts just like `multisearch`, but only loads one file (and then does all comparisons between all pairs within that file). - -Like `multisearch` and `pairwise`, `fastgather` loads everything at the beginning, and then uses multithreading to search across all matching sequences. For large databases it is extremely efficient at using all available cores. So 128 threads or more should work fine! We suggest using zipfiles for the database. - -`fastmultigather` loads the entire database once, and then loads one query from disk per thread. The compute-per-query can be significant, though, so multithreading efficiency here is less dependent on I/O and the disk is less likely to be saturated with many threads. We suggest limiting threads to between 32 and 64 to decrease shared disk load. +## Notes on concurrency and efficiency -`cluster` loads the entire file multithreaded, and then populates the graph sequentially. +Each command does things somewhat differently, with implications for +CPU and disk load; moreover, the performance will vary depending on +the database format. You can measure threading efficiency with +`/usr/bin/time -v` on Linux systems, and disk load by number of +complaints received when running. Your best bet is to +[just ask the team for suggestions](https://github.com/dib-lab/sourmash/issues), +but you can lightly skim the docs below and play with some small data +sets, too! + +--- + +`manysketch` loads one sequence file from disk per thread and sketches +it using all signature params simultaneously. + +`manysearch` loads all the queries at the beginning, and then loads +one database sketch from disk per thread. The +compute-per-database-sketch is dominated by I/O. So your number of +threads should be chosen with care for disk load. We typically limit +it to `-c 32` for shared disks. + +`multisearch` loads all the queries and database sketches once, at the +beginning, and then uses multithreading to search across all matching +sequences. For large databases it is extremely efficient at using all +available cores. So 128 threads or more should work fine! +Zipfiles should work well. + +`pairwise` acts just like `multisearch`, but only loads one file (and +then does all comparisons between all pairs within that file). + +Like `multisearch` and `pairwise`, `fastgather` loads everything at +the beginning, and then uses multithreading to search across all +matching sequences. For large databases it is extremely efficient at +using all available cores. So 128 threads or more should work fine! We +suggest using zipfiles for the database. + +`fastmultigather` loads the entire database once, and then loads one +query from disk per thread. The compute-per-query can be significant, +though, so multithreading efficiency here is less dependent on I/O and +the disk is less likely to be saturated with many threads. We suggest +limiting threads to between 32 and 64 to decrease shared disk load. + +`cluster` loads the entire file multithreaded, and then populates the +graph sequentially. ## Appendix 1 - `index` to create a low-memory index diff --git a/src/python/sourmash_plugin_branchwater/__init__.py b/src/python/sourmash_plugin_branchwater/__init__.py index a9e821a4..4c923e34 100755 --- a/src/python/sourmash_plugin_branchwater/__init__.py +++ b/src/python/sourmash_plugin_branchwater/__init__.py @@ -259,7 +259,7 @@ def __init__(self, p): p.add_argument('-o', '--output', required=True, help='CSV output file for matches') p.add_argument('-t', '--threshold', default=0.01, type=float, - help='containment threshold for reporting matches') + help='containment threshold for reporting matches (default: 0.01)') p.add_argument('-k', '--ksize', default=31, type=int, help='k-mer size at which to select sketches') p.add_argument('-s', '--scaled', default=1000, type=int,