Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: update documentation for RocksDB index and internal/external storage, + miscellaneous improvements #416

Merged
merged 8 commits into from
Aug 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
233 changes: 197 additions & 36 deletions doc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,28 +9,43 @@
| `multisearch` | Multithreaded comparison of multiple sketches, in memory | [link](#Running-multisearch-and-pairwise)
| `pairwise` | Multithreaded pairwise comparison of multiple sketches, in memory | [link](#Running-multisearch-and-pairwise)
| `cluster` | cluster sequences based on similarity data from `pairwise` or `multisearch` | [link](#Running-cluster)

This repository implements multithreaded plugins for [sourmash](https://sourmash.readthedocs.io/) that provide very fast implementations of `sketch`, `search`, and `gather`. These commands are typically hundreds to thousands of times faster, and 10-50x lower memory, than the current sourmash code. For example, a `gather` of SRR606249 with sourmash v4.8.6 against GTDB rs214 takes 40 minutes and 14 GB of RAM, while `fastgather` with 64 cores takes only 2 minutes and 2 GB of RAM.

The main *drawback* to these plugin commands is that their inputs and outputs are different (and sometimes not as rich as) the native sourmash commands. In particular, this means that your input files may need to be prepared differently, and the output may need to be processed differently.
| `index` | build a RocksDB inverted index for efficient containment queries | [link](#Running-index)

This repository implements multithreaded plugins for
[sourmash](https://sourmash.readthedocs.io/) that provide very fast
implementations of `sketch`, `search`, and `gather`. These commands
are typically hundreds to thousands of times faster, and 10-50x lower
memory, than the current sourmash code. For example, a `gather` of
SRR606249 with sourmash v4.8.6 against GTDB rs214 takes 40 minutes and
14 GB of RAM, while `fastgather` with 64 cores takes only 2 minutes
and 2 GB of RAM.

The main *drawback* to these plugin commands is that their inputs are
more restricted than the native sourmash commands, and in some cases
their outputs are in a different format. This means that your input
files may need to be prepared differently, and the output may need to
be processed differently. The plugin commands are also a bit less
user friendly, because (for now) we're more focused on speed than
polish and user experience.

**Note:** As of v0.9.5, the outputs of `fastgather` and `fastmultigather` almost completely match the output of `sourmash gather`; see below for details.

## Input file formats

sourmash supports a variety of different storage formats for sketches (see [sourmash docs](https://sourmash.readthedocs.io/en/latest/command-line.html#choosing-signature-output-formats)), and the branchwater plugin works with some (but not all) of them. Branchwater _also_ supports an additional database type, a RocksDB-based inverted index, that is not yet supported by sourmash (through v4.8.9).
sourmash supports a variety of different storage formats for sketches (see [sourmash docs](https://sourmash.readthedocs.io/en/latest/command-line.html#choosing-signature-output-formats)), and the branchwater plugin works with some (but not all) of them. Branchwater _also_ supports an additional database type, a RocksDB-based inverted index, that is not (yet) supported natively by sourmash (through v4.8.11).

<!-- **As of v0.9.0, we recommend using zip files or manifest CSVs whenever you need to provide multiple sketches.** CTB -->

| command | query input | database format |
| command | command input | database format |
| -------- | -------- | -------- |
| `manysketch` | CSV with input fasta/fastq paths (details below) | _produces_ Zip database |
| `gather` | Single metagenome in sig, zip, or fromfile | Zip or fromfile |
| `fastmultigather` | Multiple metagenomes in sig, zip, or fromfile | Zip, fromfile, or rocksdb index |
| `manysearch` | Multiple genomes in sig, zip, or fromfile | Zip, fromfile, or rocksdb index |
| `multisearch` | Multiple sketches in sig, zip, or fromfile | Multiple sketches in sig, zip, or fromfile |
| `pairwise` | Multiple sketches in sig, zip, or fromfile | N/A |
| `gather` | Single metagenome in sig, zip, or pathlist | Zip or pathlist |
| `fastmultigather` | Multiple metagenomes in sig, zip, or pathlist | Zip, pathlist, or rocksdb index |
| `manysearch` | Multiple genomes in sig, zip, or pathlist | Zip, pathlist, or rocksdb index |
| `multisearch` | Multiple sketches in sig, zip, or pathlist | Multiple sketches in sig, zip, or pathlist |
| `pairwise` | Multiple sketches in sig, zip, or pathlist | N/A |
| `cluster`| Output from `pairwise` or `multisearch`| N/A |
| `index` | Multiple sketches in sig, zip, or pathlist | N/A |

### Using zipfiles

Expand Down Expand Up @@ -70,7 +85,7 @@ And the answer is: manifests. Manifests are a sourmash filetype that contains in

The branchwater plugin supports manifest CSVs. These can be created from lists of sketches by using `sourmash sig collect` or `sourmash sig manifest`; for example,
```
sourmash sig manifest <from file> -o manifest.csv
sourmash sig manifest <pathlist> -o manifest.csv
```
will create a manifest CSV from a list of sketches.
-->
Expand All @@ -79,18 +94,18 @@ will create a manifest CSV from a list of sketches.

The branchwater plugin also supports a database type that is not yet supported by sourmash: inverted indexes stored in a RocksDB database. These indexes provide fast and low-memory lookups when searching very large datasets, and are used for the branchwater petabase scale search hosted at [branchwater.sourmash.bio](https://branchwater.sourmash.bio).

Some commands - `fastmultigather` and `manysearch` - support using these RocksDB-based inverted indexes. They can be created by running `sourmash scripts index`.
Some commands - `fastmultigather` and `manysearch` - support using these RocksDB-based inverted indexes. They can be created by running `sourmash scripts index`. See [the `index` documentation, below](#Running-index).

### Using "fromfiles"
### Using "pathlists"

<!-- **Note: We no longer recommend using "fromfiles". Use zip files or manifests instead.** CTB -->
<!-- **Note: We no longer recommend using "pathlists". Use zip files or manifests instead.** CTB -->

You can make a fromfile by listing a collection of .sig.gz files like so:
You can make a pathlist by listing a collection of .sig.gz files like so:
```
find /path/to/directory/ -name "*.sig.gz" -type f > directory.txt
```

When using a fromfile for search, we load all signatures into memory at the start in order to generate a manifest. To avoid memory issues, the signatures are not kept in memory, but instead re-loaded as described below for each command (see: Notes on concurrency and efficiency). This makes using fromfiles less efficient than `zip` files (as of v0.9.0).
When using a pathlist for search, we load all signatures into memory at the start in order to generate a manifest. To avoid memory issues, the signatures are not kept in memory, but instead re-loaded as described below for each command (see: Notes on concurrency and efficiency). This makes using pathlists less efficient than `zip` files (as of v0.9.0).


<!-- or manifests... CTB -->
Expand All @@ -99,9 +114,19 @@ When using a fromfile for search, we load all signatures into memory at the star

### Running `manysketch`

The `manysketch` command sketches one or more FASTA/FASTQ files into a zipped sourmash signature collection (`zip`). `manysketch` uses one thread per input file, so it can (very) efficiently sketch many files at once; and, because sequence file parsing is entirely implemented in Rust, it is much, _much_ faster than `sourmash sketch` for large FASTQ files. However, it does not currently support translation, i.e. protein signature generation from DNA FASTA.
The `manysketch` command sketches a list of FASTA/FASTQ files into a
zipped sourmash signature collection (`zip`). `manysketch` uses one
thread per input file, so it can (very) efficiently sketch many files
at once; and, because sequence file parsing is entirely implemented in
Rust, it is much, _much_ faster than `sourmash sketch` for large FASTQ
files. However, it does not currently support translation,
i.e. protein signature generation from DNA FASTA.

#### specifying input FASTA
`manysketch` can be used to sketch collections of genomes, pairs of R1/R2
files, and more flexible collections based on prefix - which approach is
used depends on the columns provided to `manysketch` in the input CSV file.

#### Specifying input FASTA

To run `manysketch`, you need to build a text file list of FASTA/FASTQ files (see `manysketch.csv` example, below).

Expand Down Expand Up @@ -174,6 +199,10 @@ The `pairwise` command does the same comparisons as `multisearch` but takes
only a single collection of sketches, for which it calculates all the pairwise comparisons. Since the comparisons are symmetric, it is approximately
twice as fast as `multisearch`.

The `-t/--threshold` for `multisearch` and `pairwise` applies to the
containment of query-in-target and defaults to 0.01. To report
_any_ overlap between two sketches, set the threshold to 0.

### Running `fastgather`

The `fastgather` command is a much faster version of `sourmash gather`.
Expand Down Expand Up @@ -221,10 +250,33 @@ sourmash scripts manysearch queries.zip metagenomes.manifest.csv -o results.csv
```
<!-- We suggest using a manifest CSV for the metagenome collection. -->

The results file here, `query.x.gtdb-reps.csv`, will have the following columns: `query`, `query_md5`, `match_name`, `match_md5`, `containment`, `jaccard`, `max_containment`, `intersect_hashes`, `query_containment_ani`.
The results file here, `query.x.gtdb-reps.csv`, will have the
following columns: `query`, `query_md5`, `match_name`, `match_md5`,
`containment`, `jaccard`, `max_containment`, `intersect_hashes`,
`query_containment_ani`.

If you run `manysearch` _without_ using a rocksdb database, the results file will also have the following columns: `average_abund`, `median_abund`, `std_abund`, `match_containment_ani`, `average_containment_ani`, `max_containment_ani`. If the query sketches were not built with abundance tracking enabled, `average_abund` and `median_abund` will default to `1.0`; `std_abund` will default to `0.0`.
If you run `manysearch` _without_ using a RocksDB database (that is,
against regular sketches), the results file will also have the
following columns: , `match_containment_ani`,
`average_containment_ani`, and `max_containment_ani`.

Finally, if using sketches that have abundance information, the
results file will also contain the following columns: `average_abund`,
`median_abund`, `std_abund`, `n_weighted_found`, and `total_weighted_hashes`.

See
[the prefetch CSV output column documentation](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html#appendix-e-prefetch-csv-output-columns)
for information on these various columns.

The `-t/--threshold` for `manysearch` applies to the containment of
query-in-database (e.g. genome in metagenome) and defaults to 0.01.
To report _any_ overlap between two sketches, set the threshold to 0.
(This will produce many, many results when searching a collection of
metagenomes!)

By default, `manysearch` will display the contents of the CSV file in a
human-readable format. This can be disabled with `-N/--no-pretty-print`
when executing large searches.

### Running `cluster`

Expand All @@ -250,23 +302,132 @@ cluster_size,count

`cluster` takes a `--similarity_column` argument to specify which of the similarity columns, with the following choices: `containment`, `max_containment`, `jaccard`, `average_containment_ani`, `maximum_containment_ani`. All values should be input as fractions (e.g. 0.9 for 90%)

## Notes on concurrency and efficiency
### Running `index`

The `index` subcommand creates a RocksDB inverted index that can be
used as a database for `manysearch` (containment queries into
mixtures) and `fastmultigather` (mixture decomposition against a
database of genomes).

RocksDB inverted indexes support fast, low-latency, and low-memory
queries, in exchange for a time-intensive indexing step and extra disk
space. They are also used for the
[branchwater.sourmash.bio](https://branchwater.sourmash.bio/)
real-time SRA metagenome query.

Note that RocksDB indexes do not support abundance estimation for
containment queries.

To run `index`, provide it with multiple sketches in sig, zip, or
pathlist format, and specify the desired output directory; we suggest
using the `.rocksdb` extension for RocksDB databases, e.g. `-o
gtdb-rs214-k31.rocksdb`.

By default, as of v0.9.7, `index` will store a copy of the sketches
along with the inverted index. This will substantially increase the
disk space required for large databases. You can provide an optional
`--no-internal-storage` to `index` to store them externally, which
reduces the disk space needed for the index. Read below for technical
details!

#### Internal vs external storage of sketches in a RocksDB index

(The below applies to v0.9.7 and later of the plugin; for v0.9.6 and
before, only external storage was implemented.)

RocksDB indexes support containment queries (a la the
[branchwater application](https://github.com/sourmash-bio/branchwater)),
as well as `gather`-style mixture decomposition (see
[Irber et al., 2022](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2)).
For this plugin, the `manysearch` command supports a RocksDB index for
the database for containment queries, and `fastmultigather` can use a
RocksDB index for the database of genomes.

RocksDB indexes contain references to the sketches used to construct
the index. If `--internal-storage` is set (which is the default), a
copy of the sketches is stored within the RocksDB database directory;
if `--no-internal-storage` is provided, then the references point to
the original source sketches used to construct the database, wherever
they reside on your disk.

The sketches *are not used* by `manysearch`, but *are used* by
`fastmultigather`: with v0.9.6 and later, you'll get an error if you
run `fastmultigather` against a RocksDB index where the sketches
cannot be loaded.

What this means is therefore a bit complicated, but boils down to
the following two approaches:

1. The safest thing to do is build a RocksDB index and use internal
storage (the default). This will consume more disk space but your
RocksDB database will always be usable for both `manysearch` and
`fastmultigather`, as well as the branchwater app.
2. If you want to avoid storing duplicate copies of your sketches,
then specify `--no-internal-storage` and provide a stable absolute
path to the source sketches. This will again support both
`manysearch` and `fastmultigather`, as well as the branchwater app.
If the source sketches later become unavailable, `fastmultigather`
will stop working (although `manysearch` and the branchwater app
should be fine).

You should (for the moment) avoid specifying relative paths to the
sketches when running `index`. Follow
[sourmash_branchwater_plugin#415](https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/415)
if better support for relative paths is of interest!

#### Links and more materials

Note that RocksDB indexes are implemented in the core
[sourmash Rust library](https://crates.io/crates/sourmash), and
used in downstream software packages (this plugin, and
[the branchwater application code](https://github.com/sourmash-bio/branchwater)). The above documentation applies to sourmash core v0.15.0.

Each command does things slightly differently, with implications for CPU and disk load. You can measure threading efficiency with `/usr/bin/time -v` on Linux systems, and disk load by number of complaints received when running.

`manysketch` loads one sequence file from disk per thread and sketches it using all signature params simultaneously.

`manysearch` loads all the queries at the beginning, and then loads one database sketch from disk per thread. The compute-per-database-sketch is dominated by I/O. So your number of threads should be chosen with care for disk load. We typically limit it to `-c 32` for shared disks. <!-- We suggest using a manifest CSV file for the database sketches. CTB -->

`multisearch` loads all the queries and database sketches once, at the beginning, and then uses multithreading to search across all matching sequences. For large databases it is extremely efficient at using all available cores. So 128 threads or more should work fine! Zipfiles <!-- and manifests CTB --> should work well.

`pairwise` acts just like `multisearch`, but only loads one file (and then does all comparisons between all pairs within that file).

Like `multisearch` and `pairwise`, `fastgather` loads everything at the beginning, and then uses multithreading to search across all matching sequences. For large databases it is extremely efficient at using all available cores. So 128 threads or more should work fine! We suggest using zipfiles <!-- or manifests CTB --> for the database.

`fastmultigather` loads the entire database once, and then loads one query from disk per thread. The compute-per-query can be significant, though, so multithreading efficiency here is less dependent on I/O and the disk is less likely to be saturated with many threads. We suggest limiting threads to between 32 and 64 to decrease shared disk load.
## Notes on concurrency and efficiency

`cluster` loads the entire file multithreaded, and then populates the graph sequentially.
Each command does things somewhat differently, with implications for
CPU and disk load; moreover, the performance will vary depending on
the database format. You can measure threading efficiency with
`/usr/bin/time -v` on Linux systems, and disk load by number of
complaints received when running. Your best bet is to
[just ask the team for suggestions](https://github.com/dib-lab/sourmash/issues),
but you can lightly skim the docs below and play with some small data
sets, too!

---

`manysketch` loads one sequence file from disk per thread and sketches
it using all signature params simultaneously.

`manysearch` loads all the queries at the beginning, and then loads
one database sketch from disk per thread. The
compute-per-database-sketch is dominated by I/O. So your number of
threads should be chosen with care for disk load. We typically limit
it to `-c 32` for shared disks. <!-- We suggest using a manifest CSV
file for the database sketches. CTB -->

`multisearch` loads all the queries and database sketches once, at the
beginning, and then uses multithreading to search across all matching
sequences. For large databases it is extremely efficient at using all
available cores. So 128 threads or more should work fine!
Zipfiles <!-- and manifests CTB --> should work well.

`pairwise` acts just like `multisearch`, but only loads one file (and
then does all comparisons between all pairs within that file).

Like `multisearch` and `pairwise`, `fastgather` loads everything at
the beginning, and then uses multithreading to search across all
matching sequences. For large databases it is extremely efficient at
using all available cores. So 128 threads or more should work fine! We
suggest using zipfiles <!-- or manifests CTB --> for the database.

`fastmultigather` loads the entire database once, and then loads one
query from disk per thread. The compute-per-query can be significant,
though, so multithreading efficiency here is less dependent on I/O and
the disk is less likely to be saturated with many threads. We suggest
limiting threads to between 32 and 64 to decrease shared disk load.

`cluster` loads the entire file multithreaded, and then populates the
graph sequentially.

## Appendix 1 - `index` to create a low-memory index

Expand Down
2 changes: 1 addition & 1 deletion src/python/sourmash_plugin_branchwater/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,7 @@ def __init__(self, p):
p.add_argument('-o', '--output', required=True,
help='CSV output file for matches')
p.add_argument('-t', '--threshold', default=0.01, type=float,
help='containment threshold for reporting matches')
help='containment threshold for reporting matches (default: 0.01)')
p.add_argument('-k', '--ksize', default=31, type=int,
help='k-mer size at which to select sketches')
p.add_argument('-s', '--scaled', default=1000, type=int,
Expand Down