From 256a7053c65f97875f9467667d579f1f0cef9498 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 25 Sep 2023 10:25:50 -0700 Subject: [PATCH 1/5] MRG: add `__all__` to `sig/__main__.py` (#2778) This change limits what is imported with `from sourmash.sig.__main__ import *`. Fixes https://github.com/sourmash-bio/sourmash/issues/2714 --- src/sourmash/sig/__main__.py | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/src/sourmash/sig/__main__.py b/src/sourmash/sig/__main__.py index 48addfe1e4..809b00ab78 100644 --- a/src/sourmash/sig/__main__.py +++ b/src/sourmash/sig/__main__.py @@ -1,6 +1,27 @@ """ Command-line entry point for 'python -m sourmash.sig' """ +__all__ = ["cat", + "split", + "describe", + "manifest", + "overlap", + "merge", + "intersect", + "inflate", + "subtract", + "rename", + "extract", + "filter", + "flatten", + "downsample", + "sig_import", + "export", + "kmers", + "fileinfo", + "check", + "collect"] + import sys import csv import json From 66bc9107a31f48e392c73027b550c8019eea9ae8 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 25 Sep 2023 10:54:06 -0700 Subject: [PATCH 2/5] MRG: adjust documentation to recommend `tax` over `lca` for taxonomic analysis (#2777) This PR adds cautionary notes to the command line docs, and updates the information on classifying signatures to suggest using tax instead of LCA, and even explains why :). There is more work to be done - we need to add more tutorials, and adjust the language in classifying-signatures around gather and LCA - but this is a nice standalone PR! Fixes https://github.com/sourmash-bio/sourmash/issues/2562 Fixes https://github.com/sourmash-bio/sourmash/issues/2772 Fixes https://github.com/sourmash-bio/sourmash/issues/2773 Adds information from https://github.com/sourmash-bio/sourmash/issues/2760 Addresses https://github.com/sourmash-bio/sourmash/issues/2535 --- doc/classifying-signatures.md | 108 +++++++++++++++++++++++++++++----- doc/command-line.md | 50 ++++++++++++++-- 2 files changed, 139 insertions(+), 19 deletions(-) diff --git a/doc/classifying-signatures.md b/doc/classifying-signatures.md index 868ef6f7c8..ba91709d3b 100644 --- a/doc/classifying-signatures.md +++ b/doc/classifying-signatures.md @@ -1,5 +1,9 @@ # Classifying signatures: `search`, `gather`, and `lca` methods. +```{contents} Contents +:depth: 3 +``` + sourmash provides several different techniques for doing classification and breakdown of signatures. @@ -41,6 +45,8 @@ question to ask is the reverse: **what genomes are in my metagenome?** We have implemented two approaches in sourmash to do this. + + One approach uses taxonomic information from e.g. GenBank to classify individual k-mers, and then infers taxonomic distributions of metagenome contents from the presence of these individual @@ -65,21 +71,59 @@ Some benchmarking on CAMI suggests that `gather` is a very accurate method for doing strain-level resolution of genomes. More on that as we move forward! -## To do taxonomy, or not to do taxonomy? +## Taxonomic profiling with sourmash -By default, there is no structured taxonomic information available in -sourmash signatures or SBT databases of signatures. Generally what -this means is that you will have to provide your own mapping from a -match to some taxonomic hierarchy. This is generally the case when -you are working with lots of genomes that have no taxonomic -information. +sourmash supports two basic kinds of taxonomic profiling, under the +`lca` and `tax` modules. **As of 2023, we strongly recommend the +`tax`-based profiling approach.** -The `lca` subcommands, however, work with LCA databases, which contain -taxonomic information by construction. This is one of the main -differences between the `sourmash lca` subcommands and the basic -`sourmash search` functionality. So the `lca` subcommands will generally -output structured taxonomic information, and these are what you should look -to if you are interested in doing classification. +But first, let's back up! By default, there is no structured taxonomic +information available in sourmash signatures or collections. What +this means is that you will have to provide your own mapping from a +match to some taxonomic hierarchy. Both the `lca` and `tax` modules +support identifier-based taxonomic mappings, in which identifiers +from the signature names can be linked to the standard seven NCBI/GTDB +taxonomy ranks - superkingdom, phylum, class, order, family, genus, and +species. These are typically provided in a spreadsheet _separately_ from +the signature collection. The `tax` module also supports `lins` taxonomies, +for which we have a tutorial. + +There are several advantages that this approach affords sourmash. One +is that sourmash is not tied closely to a specific taxonomy - you can +use either GTDB or NCBI as you wish. Another advantage is that you can +create your own custom taxonomic ranks and even use them with private +databases of genomes to classify your own metagenomes. + +The main disadvantage of sourmash's approach to taxonomy is that +sourmash doesn't classify individual metagenomic reads to either a genome +or a taxon. (Note that we're not sure +this can be done robustly in practice - neither short nor long reads typically +contain enough information to uniquely identify a single genome.) If you +want to do this, we suggest running `sourmash gather` first, and then +mapping the reads to the matching genomes; then you can use the mapping +to determine which read maps to which genome. This is the approach taken by +[the genome-grist pipeline](https://dib-lab.github.io/genome-grist/). + + + +### Using `sourmash tax` to do taxonomic analysis + +We recommend using the `tax` module to do taxonomic classification of +genomes (with `tax genome`) and metagenomes (with `tax metagenome`). +The `tax` module commands operate downstream of `sourmash gather`, +which builds a minimum set cover of the query against the database - +intuitively speaking, this is the shortest possible list of genomes +that the query would map to. Then, both `tax genome` and `tax +metagenome` take the CSV output of `sourmash gather` and produce +taxonomic profiles. (You can read more about minimum set covers +in +[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers, Irber et al., 2022](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2).) + +The `tax metagenome` approach was benchmarked in +[Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets, Portik et al., 2022](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05103-0) +and appears to be both very accurate and very sensitive, unless you're +using Nanopore data or other data types that have a high sequencing +error rate. It's important to note that taxonomy based on k-mers is very, very specific and if you get a match, it's pretty reliable. On the @@ -88,6 +132,40 @@ to evolutionary divergence, so if you don't get a match it may only mean that the specific species or genus you're searching for isn't in the database. +### Using `sourmash lca` to do taxonomic analysis + +The `sourmash lca` module supports taxonomic classification using +single hashes, corresponding to single k-mers, in an approach inspired +by Kraken. Briefly, you first build an LCA database using `lca index`, +which takes a taxonomy spreadsheet and a collection of sketches. Then, +you can use `lca classify` to classify single-genome sketches or +`lca summarize` to classify metagenomes. + +The `lca` approach is not published anywhere, but we're happy to discuss +it in more detail; just [post to the issue tracker](https://github.com/sourmash-bio/sourmash/issues). + +While we do not recommend the `lca` approach for general taxonomic +classification purposes (see below!), it remains useful for certain +kinds of diagnostic evaluation of sequences, so we are leaving the +functionality in sourmash. + +### `sourmash tax` vs `sourmash lca` + +Why do we recommend using the `tax` module over `lca`? `sourmash lca` +was the first implementation in sourmash, and over the years we've +found that it is prone to false positives: that is, individual k-mers +are very sensitive but are often misassigned to higher taxonomic ranks +than they need to be, either because of contamination in the reference +database or because the taxonomy is not based directly on genome +similarity. Instead of using single k-mers, `sourmash gather` estimates +the best matching genome based on combinations of k-mers, which is much +more specific than the LCA approach; only then is a taxonomy assigned +using `sourmash tax`. + +The bottom line is that in our experience, `sourmash tax` is as +sensitive as `lca`, and a lot more specific. Please let us know if you +discover differently! + ## Abundance weighting By default, sourmash tracks k-mer presence, *not* their abundance. The @@ -166,9 +244,9 @@ an abundance-weighted query, and automatically apply `--ignore-abundance`. As of v4.4, `sourmash` can estimate Average Nucleotide Identity (ANI) between two FracMinHash ("scaled") sketches. `sourmash compare` can now produce a matrix of ANI values estimated from Jaccard, Containment, -or Max Containment by specifiing `--ani` (optionally along with search type, +or Max Containment by specifying `--ani` (optionally along with search type, e.g. `--containment`). `sourmash search`, `sourmash prefetch`, and -`sourmash gather` will now output ANI estimates to output csvs. +`sourmash gather` will now output ANI estimates to output CSVs. Note that while ANI can be estimated from either the Jaccard Index or the Containment Index, ANI from Containment is preferable (more accurate). diff --git a/doc/command-line.md b/doc/command-line.md index 12f6a9005c..621c20a8a1 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -97,13 +97,21 @@ information; these are grouped under the `sourmash tax` and `sourmash lca` commands: +```{attention} + +We do not recommend using the `lca` subcommands for taxonomic analysis +any more; please use `sourmash tax` instead. See +[taxonomic profiling with sourmash](classifying-signatures.md#taxonomic-profiling-with-sourmash) +for more information. +``` + * `lca classify` classifies many signatures against an LCA database. * `lca summarize` summarizes the content of metagenomes using an LCA database. * `lca index` creates a database for use with LCA subcommands. * `lca rankinfo` summarizes the content of a database. * `lca compare_csv` compares lineage spreadsheets, e.g. those output by `lca classify`. -> See [the LCA tutorial](tutorials-lca.md) for a +See [the LCA tutorial](tutorials-lca.md) for a walkthrough of some of these commands. Finally, there are a number of utility and information commands: @@ -484,8 +492,14 @@ signatures, rather than all the signatures in the database. ## `sourmash tax` subcommands for integrating taxonomic information into gather results +The `sourmash tax` subcommands support taxonomic analysis of genomes +and taxonomic profiling of metagenomes. +See +[taxonomic profiling with sourmash](classifying-signatures.md#taxonomic-profiling-with-sourmash) +for more information. + The sourmash `tax` or `taxonomy` commands integrate taxonomic - information into the results of `sourmash gather`. All `tax` commands + information with the results of `sourmash gather`. All `tax` commands require one or more properly formatted `taxonomy` files where the identifiers correspond to those in the database(s) used for `gather`. Note that if using multiple databases, the `gather` needs @@ -1075,8 +1089,7 @@ and 120,757 within the `p__Proteobacteria`. ## `sourmash lca` subcommands for in-memory taxonomy integration These commands use LCA databases (created with `lca index`, below, or -prepared databases such as -[genbank-k31.lca.json.gz](databases.md)). +prepared databases such as [genbank-k31.lca.json.gz](databases.md)). ### `sourmash lca classify` - classify a genome using an LCA database @@ -1084,6 +1097,13 @@ prepared databases such as list of LCA DBs. It is meant for classifying metagenome-assembled genome bins (MAGs) and single-cell genomes (SAGs). +```{attention} +We no longer recommend using `sourmash lca` for taxonomic analysis; +please use `sourmash tax` instead. See +[taxonomic profiling with sourmash](classifying-signatures.md#taxonomic-profiling-with-sourmash) +for more information. +``` + Usage: ``` @@ -1132,6 +1152,21 @@ number of additional k-mers in the input signature were classified as taxonomic assignment below genus *Shewanella* and it would report a status of `disagree` with the genus-level assignment of *Shewanella*; species level assignments would not be reported. +Here, the assigned rank is the rank immediately *above* where there is +a taxonomic disagreement, and the taxid & lineage refer to the name at +that rank (the least-common-ancestor at which an assignment can be +made). + +For another example, if you saw this line in the CSV file: + +``` +TARA_ASW_MAG_00029,1224,disagree,phylum,Bacteria;Proteobacteria +``` + +you would know that TARA_ASW_MAG_00029 has k-mers that are shared +between different orders: 'Pseudomonadales' and +'Rhodobacterales'. Therefore, the classifier status is `disagree`, and +the classified taxid is at rank `phylum` - just above `order`. (This is the approach that Kraken and other lowest common ancestor implementations use, we believe.) @@ -1151,6 +1186,13 @@ exploring metagenomes and metagenome-assembled genome bins. that output percentages are weighted by the number of times a k-mer is seen; this can be turned off with `--ignore-abundance`. +```{attention} +We no longer recommend using `sourmash lca` for taxonomic analysis; +please use `sourmash tax` instead. See +[taxonomic profiling with sourmash](classifying-signatures.md#taxonomic-profiling-with-sourmash) +for more information. +``` + Usage: ``` From 35e5fb7a4b5c79787e2f65b171941ea50d62250d Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 25 Sep 2023 15:53:34 -0700 Subject: [PATCH 3/5] Bump rayon from 1.7.0 to 1.8.0 (#2782) Bumps [rayon](https://github.com/rayon-rs/rayon) from 1.7.0 to 1.8.0.
Changelog

Sourced from rayon's changelog.

Release rayon 1.8.0 / rayon-core 1.12.0 (2023-09-20)

  • The minimum supported rustc is now 1.63.
  • Added ThreadPoolBuilder::use_current_thread to use the builder thread as part of the new thread pool. That thread does not run the pool's main loop, but it may participate in work-stealing if it yields to rayon in some way.
  • Implemented FromParallelIterator<T> for Box<[T]>, Rc<[T]>, and Arc<[T]>, as well as FromParallelIterator<Box<str>> and ParallelExtend<Box<str>> for String.
  • ThreadPoolBuilder::build_scoped now uses std::thread::scope.
  • The default number of threads is now determined using std::thread::available_parallelism instead of the num_cpus crate.
  • The internal logging facility has been removed, reducing bloat for all users.
  • Many smaller performance tweaks and documentation updates.
Commits

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=rayon&package-manager=cargo&previous-version=1.7.0&new-version=1.8.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) ---
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- Cargo.lock | 43 ++++++------------------------------------- src/core/Cargo.toml | 2 +- 2 files changed, 7 insertions(+), 38 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index a2d9e0b6d7..1c08fd06e6 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -327,16 +327,6 @@ dependencies = [ "itertools 0.10.3", ] -[[package]] -name = "crossbeam-channel" -version = "0.5.8" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a33c2bf77f2df06183c3aa30d1e96c0695a313d4f9c453cc3762a6db39f99200" -dependencies = [ - "cfg-if", - "crossbeam-utils", -] - [[package]] name = "crossbeam-deque" version = "0.8.1" @@ -542,15 +532,6 @@ version = "0.4.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "95505c38b4572b2d910cecb0281560f54b440a19336cbbcb27bf6ce6adc6f5a8" -[[package]] -name = "hermit-abi" -version = "0.1.19" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "62b467343b94ba476dcb2500d242dadbb39557df889310ac77c5d99100aaac33" -dependencies = [ - "libc", -] - [[package]] name = "hermit-abi" version = "0.3.2" @@ -587,7 +568,7 @@ version = "1.0.11" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "eae7b9aee968036d54dce06cebaefd919e4472e753296daccd6d344e3e2df0c2" dependencies = [ - "hermit-abi 0.3.2", + "hermit-abi", "libc", "windows-sys", ] @@ -598,7 +579,7 @@ version = "0.4.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "adcf93614601c8129ddf72e2d5633df827ba6551541c6d8c59520a371475be1f" dependencies = [ - "hermit-abi 0.3.2", + "hermit-abi", "io-lifetimes", "rustix 0.37.20", "windows-sys", @@ -817,16 +798,6 @@ dependencies = [ "libm", ] -[[package]] -name = "num_cpus" -version = "1.13.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "19e64526ebdee182341572e50e9ad03965aa510cd94427a4549448f285e957a1" -dependencies = [ - "hermit-abi 0.1.19", - "libc", -] - [[package]] name = "once_cell" version = "1.18.0" @@ -1029,9 +1000,9 @@ dependencies = [ [[package]] name = "rayon" -version = "1.7.0" +version = "1.8.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1d2df5196e37bcc87abebc0053e20787d73847bb33134a69841207dd0a47f03b" +checksum = "9c27db03db7734835b3f53954b534c91069375ce6ccaa2e065441e07d9b6cdb1" dependencies = [ "either", "rayon-core", @@ -1039,14 +1010,12 @@ dependencies = [ [[package]] name = "rayon-core" -version = "1.11.0" +version = "1.12.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4b8f95bd6966f5c87776639160a66bd8ab9895d9d4ab01ddba9fc60661aebe8d" +checksum = "5ce3fb6ad83f861aac485e76e1985cd109d9a3713802152be56c3b1f0e0658ed" dependencies = [ - "crossbeam-channel", "crossbeam-deque", "crossbeam-utils", - "num_cpus", ] [[package]] diff --git a/src/core/Cargo.toml b/src/core/Cargo.toml index fe05a69d2e..4d658491d0 100644 --- a/src/core/Cargo.toml +++ b/src/core/Cargo.toml @@ -40,7 +40,7 @@ niffler = { version = "2.3.1", default-features = false, features = [ "gz" ] } nohash-hasher = "0.2.0" num-iter = "0.1.43" once_cell = "1.18.0" -rayon = { version = "1.7.0", optional = true } +rayon = { version = "1.8.0", optional = true } serde = { version = "1.0.168", features = ["derive"] } serde_json = "1.0.107" primal-check = "0.3.1" From 7b0da53821859217f0f2a95100a0142f7755258e Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 25 Sep 2023 23:18:34 +0000 Subject: [PATCH 4/5] Bump memmap2 from 0.7.1 to 0.8.0 (#2780) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit [//]: # (dependabot-start) ⚠️ **Dependabot is rebasing this PR** ⚠️ Rebasing might not happen immediately, so don't worry if this takes some time. Note: if you make any changes to this PR yourself, they will take precedence over the rebase. --- [//]: # (dependabot-end) Bumps [memmap2](https://github.com/RazrFalcon/memmap2-rs) from 0.7.1 to 0.8.0.
Changelog

Sourced from memmap2's changelog.

[0.8.0] - 2023-09-25

Changed

Fixed

  • Some of the Advise variants were unsound and now require unsafe to be constructed. @​adamreichold
Commits
  • 2cf8233 Version bump.
  • c4b497e Change madvise API to differentiate between safe and unsafe advice options.
  • 53d6e4c A few tiny clippy-inspired nits.
  • See full diff in compare view

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=memmap2&package-manager=cargo&previous-version=0.7.1&new-version=0.8.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) ---
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- Cargo.lock | 4 ++-- src/core/Cargo.toml | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index 1c08fd06e6..a0e291e2c9 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -704,9 +704,9 @@ dependencies = [ [[package]] name = "memmap2" -version = "0.7.1" +version = "0.8.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f49388d20533534cd19360ad3d6a7dadc885944aa802ba3995040c5ec11288c6" +checksum = "43a5a03cefb0d953ec0be133036f14e109412fa594edc2f77227249db66cc3ed" dependencies = [ "libc", ] diff --git a/src/core/Cargo.toml b/src/core/Cargo.toml index 4d658491d0..3eb6e9f9f1 100644 --- a/src/core/Cargo.toml +++ b/src/core/Cargo.toml @@ -49,7 +49,7 @@ typed-builder = "0.14.0" twox-hash = "1.6.0" vec-collections = "0.3.4" piz = "0.5.0" -memmap2 = "0.7.1" +memmap2 = "0.8.0" ouroboros = "0.18.0" [dev-dependencies] From 8e63153372dc76de4e11d38a725de5d2b1c9fd5c Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 26 Sep 2023 06:52:07 -0700 Subject: [PATCH 5/5] MRG: update more text to suggest `sourmash tax` / deprecate `sourmash lca` (#2784) This PR further updates the command-line docs and the classifying-signatures docs to deprecate LCA. Fixes https://github.com/sourmash-bio/sourmash/issues/2779 --- doc/classifying-signatures.md | 72 +++++++++++++++++------------------ doc/command-line.md | 7 ++-- 2 files changed, 38 insertions(+), 41 deletions(-) diff --git a/doc/classifying-signatures.md b/doc/classifying-signatures.md index ba91709d3b..756e5b4728 100644 --- a/doc/classifying-signatures.md +++ b/doc/classifying-signatures.md @@ -35,41 +35,38 @@ analysis only. See [the main sourmash tutorial](tutorial-basic.md#make-and-search-a-database-quickly) for information on using `search` with and without `--containment`. -## Breaking down metagenomic samples with `gather` and `lca` +## Analyzing metagenomic samples with `gather` Neither search option (similarity or containment) is effective when -comparing or searching with metagenomes, which typically have a +comparing or searching with metagenomes, which typically contain a mixture of many different genomes. While you might use containment to see if a query genome is present in one or more metagenomes, a common question to ask is the reverse: **what genomes are in my metagenome?** - -We have implemented two approaches in sourmash to do this. - - - -One approach uses taxonomic information from e.g. GenBank to classify -individual k-mers, and then infers taxonomic distributions of -metagenome contents from the presence of these individual -k-mers. (This is the approach pioneered by -[Kraken](https://ccb.jhu.edu/software/kraken/) and used by many other tools.) -`sourmash lca` can be used to classify individual genome bins with -`classify`, or summarize metagenome taxonomy with `summarize`. The -[sourmash lca tutorial](tutorials-lca.md) -shows how to use the `lca classify` and `lca summarize` commands, and also -provides guidance on building your own database. - -The other approach, `gather`, breaks a metagenome down into individual -genomes based on greedy partitioning. Essentially, it takes a query -metagenome and searches the database for the most highly contained -genome; it then subtracts that match from the metagenome, and repeats. -At the end it reports how much of the metagenome remains unknown. The +An alternative phrasing is this: **what reference genomes should I map +my metagenomic reads to?** + +The main approach we provide in sourmash is `sourmash gather`. This +constructs the shortest possible list of reference genomes that cover +all of the known k-mers in a metagenome. We call this a *minimum +metagenome cover*. + +From an algorithmic perspective, `gather` generates a minimum set +cover for a query metagenome, using the reference database you give +it. The minimum set cover is calculated using a greedy approximation +algorithm. Essentially, `gather` takes a query metagenome and +searches the database for the most highly contained genome; it then +subtracts that match from the metagenome, and repeats. At the end it +reports how much of the metagenome remains unknown. The [basic sourmash tutorial](tutorial-basic.md#whats-in-my-metagenome) -has some sample output from using gather with GenBank. See Appendix A at -the bottom of this page for more technical details. +has some sample output from using gather with GenBank. See Appendix A +at the bottom of this page for more technical details. -Some benchmarking on CAMI suggests that `gather` is a very accurate -method for doing strain-level resolution of genomes. More on -that as we move forward! +The `gather` method is described in +[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers, Irber et al., 2022](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2). +Our benchmarking in that paper and also in +[Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets, Portik et al., 2022](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05103-0) +suggests that it is a very sensitive and specific method for +analyzing metagenomes. ## Taxonomic profiling with sourmash @@ -95,13 +92,14 @@ create your own custom taxonomic ranks and even use them with private databases of genomes to classify your own metagenomes. The main disadvantage of sourmash's approach to taxonomy is that -sourmash doesn't classify individual metagenomic reads to either a genome -or a taxon. (Note that we're not sure -this can be done robustly in practice - neither short nor long reads typically -contain enough information to uniquely identify a single genome.) If you -want to do this, we suggest running `sourmash gather` first, and then -mapping the reads to the matching genomes; then you can use the mapping -to determine which read maps to which genome. This is the approach taken by +sourmash doesn't classify individual metagenomic reads to either a +genome or a taxon. (Note that we're not sure this can be done robustly +in practice - neither short nor long reads typically contain enough +information to uniquely identify a single genome, especially if there +are many genomes from the same species present in the database.) If +you want to do this, we suggest running `sourmash gather` first, and +then mapping the reads to the matching genomes; then you can determine +which read maps to which genome. This is the approach taken by [the genome-grist pipeline](https://dib-lab.github.io/genome-grist/). @@ -125,8 +123,8 @@ and appears to be both very accurate and very sensitive, unless you're using Nanopore data or other data types that have a high sequencing error rate. -It's important to note that taxonomy based on k-mers is very, very -specific and if you get a match, it's pretty reliable. On the +It's important to note that taxonomy based on multiple k-mers is very, +very specific and if you get a match, it's pretty reliable. On the converse, however, k-mer identification is very brittle with respect to evolutionary divergence, so if you don't get a match it may only mean that the specific species or genus you're searching for isn't in diff --git a/doc/command-line.md b/doc/command-line.md index 621c20a8a1..f19d2aa5be 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -373,10 +373,9 @@ collection itself. Note: -Use `sourmash gather` to classify a metagenome against a collection of -genomes with no (or incomplete) taxonomic information. Use `sourmash -lca summarize` to classify a metagenome using a collection of genomes -with taxonomic information. +Use `sourmash gather` to analyze a metagenome against a collection of +genomes. Then use `sourmash tax metagenome` to integrate that collection +of genomes with taxonomic information. #### Alternative search mode for low-memory (but slow) search: `--linear`