Skip to content

Commit

Permalink
more FAQ
Browse files Browse the repository at this point in the history
  • Loading branch information
ctb committed Sep 26, 2023
1 parent 6576d07 commit 87ae6a2
Showing 1 changed file with 98 additions and 18 deletions.
116 changes: 98 additions & 18 deletions doc/faq.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,22 @@
# Frequently Asked Questions

```{contents} Contents
:depth: 3
```

## How is sourmash different from mash?

mash is an awesome piece of software that inspired sourmash - hence the
name we chose for sourmash! But are they the same? No, they are not!
[mash](https://mash.readthedocs.org/) is an awesome piece of software
that inspired sourmash - hence the name we chose for sourmash! But are
they the same? No, they are not!

mash is based on MinHash sketching, a technique for randomly selecting
a very small set of k-mers to represent a potentially very large
sequence. MinHash sketches can be used to estimate Jaccard similarity
(a distance metric that lets you find closely related sequences)
with very high accuracy, under one condition - that the two sequences
be of similar size.
mash is based on
[MinHash sketching](https://en.wikipedia.org/wiki/MinHash), a
technique for randomly selecting a very small set of k-mers to
represent a potentially very large sequence. MinHash sketches can be
used to estimate Jaccard similarity (a distance metric that lets you
find closely related sequences) with very high accuracy, under one
condition - that the two sequences be of similar size.

So, MinHash sketching is great when comparing bacterial genomes, which are
all around 2-10 MB in size.
Expand All @@ -20,10 +26,10 @@ genomes, because metagenomes are usually _much_ larger than genomes.

So we built sourmash around a different kind of sketch - FracMinHash -
which can estimate Jaccard similarity between sets of very different
sizes. FracMinHash sketches also support overlap and containment analysis,
and are convenient in a variety of other ways - see our paper on
FracMinHash, [Lightweight compositional analysis of metagenomes with
FracMinHash and minimum metagenome covers](@@).
sizes. FracMinHash sketches also support overlap and containment
analysis, and are convenient in a variety of other ways - see our
paper on FracMinHash,
[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2).

There are some drawbacks to FracMinHash sketches - read on!

Expand Down Expand Up @@ -60,13 +66,87 @@ that match above the genus level - family or above.

If you go higher (k=51), a higher percentage of k-mers are genome-specific.

For the core sourmash operations - search, gather, and compare -
we believe (with evidence!) that (a) the differences between k=21,
k=31, and k=51 are negligible; and that (b) k=31 works fine for
most day-to-day use of sourmash.
For the core sourmash operations - search, gather, and compare - we
believe (with evidence!) that (a) the differences between k=21, k=31,
and k=51 are negligible; and that (b) k=31 works fine for most
day-to-day use of sourmash.

We also provide [Genbank and GTDB databases](databases.md) for k=21,
k=31, and k=51.

We also provide [Genbank and GTDB databases](@@) for k=21, k=31, and
k=51, too.
For some background on k-mer specificity, we recommend this paper:
[MetaPalette: a k-mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation](https://journals.asm.org/doi/10.1128/msystems.00020-16),
Koslicki & Falush, 2016.

## How do k-mer-based analyses compare with read mapping?

tl;dr very well! But it's a bit one sided: if k-mers match, reads will
map, but not necessarily vice versa. So read mapping rates are almost always
higher than k-mer matching rates.

### Mapping reads to reference vs k-mer "detection" or containment

Let's start by looking at a simple example: suppose you have Illumina
shotgun sequencing of a new isolate, and you want to compare it to a
reference genome for a member of the same species. You calculate
k-mer containment (using e.g. `sourmash search --containment
ref.sig.gz isolate.sig.gz`) of the reference genome in the isolate shotgun
sequence to be 65%. You then map the reads from the isolate data to the
reference genome. What should you expect to see?

What you should see is that 65% or more of the reference genome is
covered by at least one read. This is known as the mapping-based
"detection" of the reference genome in the read data set, and k-mer
detection typically *underestimates* mapping-based detection.

(If you want to know _how many of the reads will map_, you need to use
a different number that is output by `sourmash gather` - although, for
single genomes with only a few repeats, the percentage of reads that
map should match the k-mer detection number you calculate with
containment.)

### Why do k-mers underestimate read mapping?

K-mers underestimate read mapping because k-mers rely on exact matches,
while read mapping tolerates mismatches. So unless you are looking at
entirely error free data with no strain variation, mismatches will
sometimes prevent k-mers from matching.

For some math: consider a k-mer of size 21. It will only match to a
specific 21 base sequence exactly. If you are matching two isolate genomes
that align completely but have a 95% average nucleotide identity, then
one out of every 20 bases will be different (on average), and, on average,
every 21-mer will be different! (In practice you'd expect about 1/3 of the
k-mers to match if the variation is random, since there will be many 21-base
windows that don't have any mismatches.) However, alignment-based approaches
like read mapping will happily align across single-base mismatches, and so
even at 95% ANI you would expect many reads to align.

In practice, the numbers will differ, but the intuition remains!

### How do read mapping rates for metagenomes compare with k-mer statistics?

@@

### Further reading and links on k-mers and mapping:

* The paper
[Biogeographic Distribution of Five Antarctic Cyanobacteria Using Large-Scale k-mer Searching with sourmash branchwater](https://www.biorxiv.org/content/10.1101/2022.10.27.514113v1),
Lumian et al., 2022, shows that k-mer detection almost always
underestimates mapping, and k-mer abundance analysis is always more
conservative than mapping-based analyses.

* The paper
[Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash](https://genome.cshlp.org/content/33/7/1061),
Rahman Hera et al., 2023, shows how to translate between average
nucleotide identity (ANI) and k-mer statistics.

* The paper
[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2),
Irber et al., 2022, describes how `sourmash gather` assigns k-mers
from metagenomes to a set of reference genomes, and shows that read
mapping correlates pretty well with k-mer overlap. Note that it is
focused on *systematic decomposition of metagenomes against reference
databases*, so it tackles the question of analyzing an entire
metagenome against all available references, not just a single
matching genome.

0 comments on commit 87ae6a2

Please sign in to comment.