From e2c199f9b544ce060abd8b35f0edf49c6655168f Mon Sep 17 00:00:00 2001
From: "C. Titus Brown" <titus@idyll.org>
Date: Tue, 30 Jan 2024 11:34:51 -0800
Subject: [PATCH] MRG: add full column descriptions for `gather` and `prefetch`
 output (#2954)

This PR adds full column descriptions for `gather` and `prefetch` to
`classifying-signatures.md`. It also updates some other details in that
document, including adding a link to the published Hera et al. paper in
2023.

See [rendered
docs](https://sourmash--2954.org.readthedocs.build/en/2954/classifying-signatures.html)!

Fixes https://github.com/sourmash-bio/sourmash/issues/2812
Fixes https://github.com/sourmash-bio/sourmash/issues/2367

---------

Co-authored-by: Colton Baumler <63077899+ccbaumler@users.noreply.github.com>
---
 doc/classifying-signatures.md | 91 ++++++++++++++++++++++++++++++-----
 doc/index.md                  |  2 +-
 doc/sidebar.md                |  2 +-
 3 files changed, 82 insertions(+), 13 deletions(-)

diff --git a/doc/classifying-signatures.md b/doc/classifying-signatures.md
index eb05de58ff..f9baffd165 100644
--- a/doc/classifying-signatures.md
+++ b/doc/classifying-signatures.md
@@ -1,12 +1,14 @@
 # Classifying signatures: `search`, `gather`, and `lca` methods.
 
+sourmash provides several different techniques for doing
+classification and breakdown of genomic and metagenomic signatures.
+These include taxonomic classification as well as decomposition of
+metagenomic data into constitutent genomes.
+
 ```{contents} Contents
 :depth: 3
 ```
 
-sourmash provides several different techniques for doing
-classification and breakdown of signatures.
-
 ## Searching for similar samples with `search`.
 
 The `sourmash search` command is most useful when you are looking for
@@ -234,10 +236,11 @@ metagenomics, please see the simka paper,
 Benoit et al., 2016.
 
 **Implementation note:** Angular similarity searches cannot be done on
-SBT or LCA databases currently; you have to provide lists of signature
-files to `sourmash search` and `sourmash compare`.  sourmash will
-provide a warning if you run `sourmash search` on an LCA or SBT with
-an abundance-weighted query, and automatically apply `--ignore-abundance`.
+SBT or LCA databases currently; you have to provide collections of
+signature files or zip file collections to `sourmash search` and
+`sourmash compare`.  sourmash will provide a warning if you run
+`sourmash search` on an LCA or SBT with an abundance-weighted query,
+and automatically apply `--ignore-abundance`.
 
 ### Estimating ANI from FracMinHash comparisons.
 
@@ -254,10 +257,7 @@ For `sourmash search`, `sourmash prefetch`, and `sourmash gather`, you can
 optionally return confidence intervals around containment-derived ANI estimates,
 which take into account the impact of the scaling factor (via `--estimate-ani-ci`).
 
-For details on ANI estimation, please see our preprint "Debiasing FracMinHash and
-deriving confidence intervals for mutation rates across a wide range of evolutionary
-distances," [here](https://www.biorxiv.org/content/10.1101/2022.01.11.475870v2),
-Hera et al., 2022.
+For details on ANI estimation, please see the paper ["Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash"](https://pubmed.ncbi.nlm.nih.gov/37344105/), Hera et al., 2023.
 
 ## What commands should I use?
 
@@ -535,3 +535,72 @@ figure out across all of the different use cases for gather.  Perhaps
 in the future we'll find a better way to represent all of these
 numbers in a more clear, concise, and interpretable way; in the
 meantime, we welcome your questions and comments!
+
+## Appendix D: Gather CSV output columns
+
+Note that order of columns is not guaranteed and may change between versions.
+
+| `Gather` column header                | Type          | Description |
+| :------------------------------: | :-------------: | :----------- |
+| `unique_intersect_bp`          | integer       | Size of overlap between match and _remaining_ query, estimated by multiplying the number of overlapping hashes by scaled. Rank/order dependent. Does not double count hashes. |
+| `intersect_bp`                 | integer       | Size of overlap between match and query, estimated by multiplying the number of overlapping hashes by scaled. Independent of rank order and will often double-count hashes. |
+| `f_orig_query`                 | float         | The fraction of the original query represented by this match. Approximates the fraction of metagenomic reads that will map to this genome. |
+| `f_match`                      | float         | The containment of the match in the query. |
+| `f_unique_to_query`            | float         | The fraction of matching hashes (unweighted) that are unique to this query; rank dependent. Will sum to the fraction of total k-mers (unweighted) that were identified. |
+| `f_unique_weighted`            | float         | The fraction of matching hashes (weighted by multiplicity) that are unique to this query. This will sum to the fraction of total _weighted_ k-mers that were identified. Approximates the fraction of metagenomic reads that will map to this genome _after_ all previous matches at lower (earlier) ranks are mapped. |
+| `average_abund`                | float         | Mean abundance of the weighted hashes unique to the intersection. Empty if query does not have abundance. Rank dependent, does not double count. |
+| `median_abund`                 | integer       | Median abundance of the weighted hashes unique to the intersection. Empty if query has no abundance. Rank dependent, does not double count. |
+| `std_abund`                    | float         | Std deviation of the abundance of the hashes unique to the intersection. Empty if query has no abundance. Rank dependent, does not double count. |
+| `filename`                     | string        | Filename/location of the database from which the match was loaded. |
+| `name`                         | string        | Full sketch name of the match. |
+| `md5`                          | string        | Full md5sum of the match sketch. |
+| `f_match_orig`                 | float         | The fraction of the match in the full query. Rank independent. |
+| `gather_result_rank`           | float         | Rank of this match in the results. |
+| `remaining_bp`                 | integer       | How many bp remain in the query after subtracting this match, estimated by multiplying remaining hashes by scaled. |
+| `query_filename`               | string        | The filename from which the query was loaded. |
+| `query_name`                   | string        | The query sketch name. |
+| `query_md5`                    | string        | Truncated md5sum of the query sketch. |
+| `query_bp`                     | integer       | Estimated number of bp in the query, estimated by multiplying the sketch size by scaled. |
+| `ksize`                        | integer       | K-mer size for the sketches used in the comparison. |
+| `moltype`                      | string        | Molecule type of the comparison. |
+| `scaled`                       | integer       | Scaled value of the comparison. |
+| `query_n_hashes`               | integer       | Number of hashes in the query sketch. |
+| `query_abundance`              | boolean       | True if the query has abundance information; False otherwise. |
+| `query_containment_ani`        | float         | ANI estimated from the query containment in the match. |
+| `match_containment_ani`        | float         | ANI estimated from the match containment in the query. |
+| `average_containment_ani`      | float         | ANI estimated from the average of the query and match containment. |
+| `max_containment_ani`          | float         | ANI estimated from the max of the query and match containment. |
+| `potential_false_negative`     | boolean       | True if the sketch size(s) were too small to give a reliable ANI estimate. False otherwise. |
+| `n_unique_weighted_found`      | integer       | Sum of (abundance-weighted) hashes found in this rank. |
+| `sum_weighted_found`            | integer       | Sum of the hashes x abundance found thus far, i.e., running total of `n_unique_weighted_found`. The last value divided by `total_weighted_hashes` will equal the total fraction of (weighted) k-mers identified. |
+| `total_weighted_hashes`         | integer       | Sum of hashes x abundance for the entire dataset. Constant value. |
+
+## Appendix E: Prefetch CSV output columns
+
+Note that order of columns is not guaranteed and may change between versions.
+
+| `Prefetch` column header                   | Type          | Description |
+| :----------------------------: | :-------------: | :----------- |
+| `intersect_bp`               | integer       | Size of overlap between match and original query, estimated by multiplying the number of overlapping hashes by `scaled`. |
+| `jaccard`                    | float         | Jaccard similarity of the two sketches. |
+| `max_containment`            | float         | Max of `f_query_match` and `f_match_query`. |
+| `f_query_match`              | float         | The fraction of the query contained by the match. |
+| `f_match_query`              | float         | The fraction of the match contained by the query. |
+| `match_filename`             | string        | Filename the match sketch was loaded from. |
+| `match_name`                 | string        | Full name of the match sketch. |
+| `match_md5`                  | string        | Truncated md5sum of match sketch (8 char). |
+| `match_bp`                   | integer       | Size of match, estimated by multiplying the sketch size by scaled. |
+| `query_filename`             | string        | Filename the query sketch was loaded from. |
+| `query_name`                 | string        | Full name of the query sketch. |
+| `query_md5`                  | string        | Truncated md5sum of query sketch (8 char). |
+| `query_bp`                   | integer       | Size of query, estimated by multiplying the sketch size by scaled. |
+| `ksize`                      | integer       | K-mer size for the sketches used in the comparison. |
+| `moltype`                    | string        | Molecule type of the sketches. |
+| `scaled`                     | integer       | Scaled value at which the comparison was done. |
+| `query_n_hashes`             | integer       | Number of hashes in the query. |
+| `query_abundance`            | integer       | Median hash abundance in the sketch, if available. |
+| `query_containment_ani`      | float         | ANI estimated from the query containment in the match. |
+| `match_containment_ani`      | float         | ANI estimated from the match containment in the query. |
+| `average_containment_ani`    | float         | ANI estimated from the average of the query and match containment. |
+| `max_containment_ani`        | float         | ANI estimated from the max containment between query/match. |
+| `potential_false_negative`   | boolean       | True if the sketch size(s) were too small to give a reliable ANI estimate. False if ANI estimate is reliable. |
diff --git a/doc/index.md b/doc/index.md
index 2639885489..20e5d4f9ff 100644
--- a/doc/index.md
+++ b/doc/index.md
@@ -84,7 +84,7 @@ X and Linux. They require about 5 GB of disk space and 5 GB of RAM.
 
 ### How-To Guides
 
-* [Classifying genome sketches](classifying-signatures.md)
+* [Classifying genome and metagenome sketches](classifying-signatures.md)
 
 * [Working with private collections of genome sketches](sourmash-collections.ipynb)
 
diff --git a/doc/sidebar.md b/doc/sidebar.md
index 8fefe519a6..5e81538fba 100644
--- a/doc/sidebar.md
+++ b/doc/sidebar.md
@@ -15,7 +15,7 @@ X and Linux. They require about 5 GB of disk space and 5 GB of RAM.
 
 ## How-To Guides
 
-* [Classifying genome sketches](classifying-signatures.md)
+* [Classifying genome and metagenome sketches](classifying-signatures.md)
 
 * [Working with private collections of genome sketches](sourmash-collections.ipynb)