diff --git a/docs/choosing_parameters.rst b/docs/choosing_parameters.rst index 6296f39..96e82b2 100644 --- a/docs/choosing_parameters.rst +++ b/docs/choosing_parameters.rst @@ -1,63 +1,93 @@ -Choosing parameters +Important Concepts =================== -The values used during de-replication and genome comparison are critical to understanding what the program is actually doing to your genome set. +The parameters selected during de-replication and genome comparison are critical to understanding what dRep is actually doing to your genome set. This section of the docs will address the following key concepts: -There are two critical high-level decisions you must make before running dRep to de-replicate a genome set: +**1. Choosing an appropriate secondary ANI threshold.** This threshold dictates how similar genomes need to be for them to be considered the “same”. This section goes into how to choose an appropriate `S_ANI` threshold. -**1. How similar do genomes need to be for them to be considered the “same”?** +**2. Minimum alignment coverage.** When calculating the ANI between genomes we also determine what fraction of the genomes were compared. The value can be used to establish minimum needed level of overlap between genomes (`cov_thresh`), but there is a lot to be aware of when using this parameter. -**2. What is the minimum genome completeness allowed in analysis?** +**3. Choosing representative genomes.** During de-replication the first step is identifying groups of similar genomes, and the second step is picking a Representative Genome (RG) for each cluster. This section gets into the second step. + +**4. Using greedy algorithms.** Greedy algorithms are those that take "shortcuts" to arrive at a solution. dRep can use greedy algorithms at several stages to speed up runtime; this section gets into how to use them and what to be aware of. + +**5. Importance of genome completeness.** Due to it's reliance on Mash genomes must be somewhat complete for dRep to handle them properly. This section explains why. + +**6. Oddities of hierarchical clustering.** After all genomes are compared, dRep uses hierarchical clustering to convert pair-wise ANI values into clusters of genomes. This section gets into how this works and where it can go wrong. + +**7. Overview of genome comparison algorithms.** dRep has lots of different genome comparison algorithms to choose from. This section briefly describes how they work and the differences between them. + +**8. Comparing and dereplicating non-bacterial genomes.** Some thoughts about the appropriate parameters to use when comparing things like bacteriophages, plasmids, and eukaryotes. .. seealso:: :doc:`module_descriptions` For more general descriptions of routine parameters + :doc:`overview` + For a general overview of dRep concepts -What defines genomes as being "same"? -------------------------------------- +1. Choosing an appropriate secondary ANI threshold +--------------------------------------------------------- -There is no standard definition of the average nucleotide identity (ANI) shared between two genomes that are the "same". This is a decision that the user must make on their own, depending on their own specific application. The ANI is determined by the **secondary clustering algorithm**, the **minimum secondary ANI** is the minimum ANI between genomes to be considered the "same", and the **minimum aligned fraction** is the minimum amount of genome overlap to believe the reported ANI value. +There is no one definition of the average nucleotide identity (ANI) shared between two genomes that are the "same". This is a decision that the user must make on their own depending on their own specific application. The ANI is determined by the **secondary clustering algorithm**, the **minimum secondary ANI** is the minimum ANI between genomes to be considered the "same", and the **minimum aligned fraction** is the minimum amount of genome overlap to believe the reported ANI value. See :ref:`7. Overview of genome comparison algorithms` for descriptions of **secondary clustering algorithms** and :ref:`2. Minimum alignment coverage` for concepts about adjusting the **minimum aligned fraction**. -.. tip:: - For help choosing this threshold, see the blog post: `Are these microbes the “same”? `_. +One use-case for dereplication is to take a large group of genomes and pick out set of Species-level Representative Genomes (SRGs). This was recently done using dRep in `Almeida et. al. `_ , where the goal was to generate a comprehensive set high-quality reference genomes for microbes that live in the human gut. These is debate about the exact threshold for delineating bacteria of different species, but **most studies agree that 95% ANI is an appropriate threshold for species-level de-replication**. See `Olm et. al. 2020 `_ for context on how this threshold was chosen and some debate on whether bacterial species exist as a continuum or as discrete units in nature. +Another use-case for dereplicaiton is to generate a set of genomes that are distinct enough to avoid mis-mapping. If one were to map metagenomic reads (which are usually ~150bp) to two genomes that were 99.9% the same, most reads would map equally well to both genomes, and you wouldn't be able to tell which read truly "belongs" to one genome or the other. **If the goal of dereplication is to generate a set of genomes that are distinct when mapping short reads, 98% ANI is an appropriate threshold.** See `the following page `_ for how the 98% value was arrived at. -Secondary clustering algorithm -++++++++++++++++++++++++++++++ +The default value in dRep is 99%, and this is a limit to how precise the secondary ANI threshold can be. As a rule of thumb thresholds as high as 99.9% are probably safe, but higher than that is beyond the limit of detection. Comparing genomes to themselves will usually yield ANI values of ~99.99%, since the algorithms aren't perfect and get thrown off by genomic repeat regions and things like this. See `the publication for the program InStrain `_ for details about testing the limits of detection for dRep, and how to perform more detailed stain-level comparisons if desired. -The **secondary clustering algorithm** is the program that will calculate the accurate Average Nucleotide Identity (ANI) between genomes. The current options supported by dRep are ANIm (`Richter 2009 `_) and gANI (`Varghese 2015 `_). +For additional thoughts on this threshold, see the blog post: `Are these microbes the “same”? `_. -* **ANIm** aligns whole genome fragments and calculates the nucleotide identity of aligned regions -* **gANI** aligns ORFs called by prodigal and calculates the nucleotide identity of aligned ORFs +.. note:: + + Keep in mind that in all cases you are collapsing closely related, but probably not identical, strains / genomes. This is simply the nature of the beast. If desired, you can compare the strains by mapping the original reads back to the de-replicated genome to visualize the strain cloud (`Bendall 2016 `_, `blog post `_) using `the program inStrain `_ -Neither of these algorithms are perfect, especially in repeat-prone genomes. Regions of the genome which are not homologous can align to each other and artificially decrease ANI. In fact, when a genome is compared to itself, ANIm often reports values <100% for this reason. gANI is better about this, but seems to be more sensitive to genome subsetting. -* **ANImf** is very similar to ANIm, but filters the alignment before calculating ANI. This takes slighty more time, but is much more accurate with repeat regions +2. Minimum alignment coverage +---------------------------------- -Minimum secondary ANI -+++++++++++++++++++++ +The **minimum aligned fraction** is the minimum amount that genomes must overlap to make the reported ANI value "count". This value is reported as part of the ANIm/gANI algorithms. -The **minimum secondary ANI** is the minimum ANI between genomes for them to be considered the "same". For context, genomes of the same species typically have >=96.5% gANI (`Varghese 2015 `_). +Imagine a scenario where two distantly related genomes share a single identical transposon. When the genome comparison algorithm is run, the transposon is probably the only part of the genomes that aligns, and the alignment will have 100% ANI. This will result in a reported ANI of 100%, and reported **aligned fraction** of ~0.1%. The **minimum aligned fraction** is to handle the above scenario- anything with less than the minimum aligned fraction of genome alignment will have the ANI changed to 0. Default value is 10%. -The default value in dRep is 99%. Preliminary testing suggests that with gANI taking this up to 99.9% is probably safe, but higher than that is beyond the limit of detection. For ANIm you really can't go above 99%, as a comparison of a genome to itself can sometimes get that low. gANI is more thrown by genome incompleteness; ANIm is more thrown by repeat-regions. +The way that dRep handles **minimum aligned fraction** can account for the above scenario, but it's pretty clumsy overall. When a comparison is below the **minimum aligned fraction** dRep literally just changes the ANI from whatever the algorithm reported to a "0". This can cause problems with hierarchical clustering down the line (see :ref:`7. Overview of genome comparison algorithms`). I've tried to think about better ways of handling it, but it's just a difficult thing to account for. How do you handle a set of genomes that share 100% ANI over 20% of their genomes? **After years of using dRep, however, I've come to this conclusion that this is rarely a problem in practical reality.** In most cases the aligned fraction is either really high (>50%) or really low (>10%), so the default value of 10% works in most cases. See Figure 1 in `Olm et. al. 2020 `_ for a depiction of typical intra-species (genomes share >95% ANI) alignment coverage values, and notice that genomes with >95% ANI almost never have low alignment coverage. .. note:: - Keep in mind that in all cases you are collapsing closely related, but probably not identical, strains / genomes. This is simply the nature of the beast. If desired, you can compare the strains by mapping the original reads back to the de-replicated genome to visualize the strain cloud (`Bendall 2016 `_, `blog post `_), or by comparing genomes within a secondary cluster using other methods (like `Mauve `_) + It has been suggested that a minimum aligned fraction of 60% should be applied to species-level taxonomic definitions (`Varghese 2015 `_). However, this is probably too stringent when incomplete genomes are being used (as is often the case with genome de-replication). -Minimum aligned fraction -++++++++++++++++++++++++ +3. Choosing representative genomes +---------------------------------- -The **minimum aligned fraction** is the minimum amount that genomes must overlap to make the reported ANI value "count". This value is reported as part of the ANIm/gANI algorithms. +dRep uses a score-based system to pick representative genomes. Each genome in the cluster is assigned a score, and the genome with the highest score is chosen as the representative. This score is based on the formula: -Imagine a scenario where two genomes of a separate phyla share a single identical transposon. When the ANIm/gANI algorithm is run, the transposon is probably the only part of the genomes that aligns, and the alignment will have 100% ANI. This will result in a reported ANI of 100%, and reported **aligned fraction** of ~0.1%. The **minimum aligned fraction** is to handle the above scenario- anything with less than the minimum aligned fraction of genome alignment will have the ANI changed to 0. Default value is 10%. +.. math:: A*Completeness - B*Contamination + C*(Contamination * (strain heterogeneity/100)) + D*log(N50) + E*log(size) + F*(centrality - S_ani) + +Where A-F are command-line arguments with default values of 1, 5, 1, 0.5, 0, and 1, respectively. Adjusting A-F lets you decide how much to weight particular features when choosing representative genomes. For example, if you really care about having low contamination and high N50, you could increase B and D. + +Completeness, Contamination, and strain heterogeneity are provided by the user or calculated with checkM. N50 is a measure of how big the pieces are that make up the genome. size is the total length of the genome. Centrality is a measure of how similar a genome is to all other genomes in it's cluster. This metric helps pick genome that are similar to all other genomes, and avoid picking genomes that are relative outliers. + +Some publications have added other metrics to their scoring when picking representative genomes, such as whether or not the genome came from an isolate. See `A unified catalog of 204,938 reference genomes from the human gut microbiome `_ and `A complete domain-to-species taxonomy for Bacteria and Archaea `_ for examples. + +4. Using greedy algorithms +------------------------------------ + +In layman's terms, greedy algorithms are those that take shortcuts to run faster and arrive at solutions that may not be optimal but are "close enough". The better the greedy algorithm, the smaller the difference between the optimal solution and the greedy solution. Since pair-wise comparisons quickly scale to a level that would take decades to compute, dRep uses a number of greedy algorithms to speed things up. + +One greedy algorithm dRep uses is primary clustering. Performing this step dramatically reduces the number of genome comparisons have to be made, decreasing run-time. The cost of this is that if genomes end up in different primary clusters they will never be compared, and thus will never be in the same final clusters. That's why the section below (Importance of genome completeness) is important. .. note:: - It has been suggested that a minimum aligned fraction of 60% should be applied to species-level taxonomic definitions (`Varghese 2015 `_). However, this is probably too stringent when incomplete genomes are being used (as is the case with genome de-replication) + In 2021 (dRep v3) several additional greedy algorithms were introduced, described below. These are relatively new features, so please don't hesitate to reach out if you notice problems or have suggestions. + +`--multiround_primary_clustering` performs primary clustering in a series of groups that are than merged at the end with single linkage clustering. This dramatically decreases RAM usage and increases speed, and the cost of a minor loss in precision and the inability to plot primary_clustering_dendrograms. Especially helpful when clustering 5000+ genomes. + +`--greedy_secondary_clustering` use a heuristic to avoid pair-wise comparisons when doing secondary clustering. The way this works is that one genome is randomly chosen to represent a cluster. Then the next genome is compared to that one. If it's below ANI thresholds with that genome, it will be put in that cluster. If it's not, it will be put into a new cluster and made the representative genome of the new cluster. The 3rd genome will then be comparing to all cluster representatives, and so on. This essentially results in single linkage clustering without the need for pair-wise comparisons. Unfortunately this doesn't increase speed as much as you would expect due to `the need of FastANI to continually re-sketch genomes `_. This option only works for the fastANI `S_algorithm` at the moment. + +`--run_tertiary_clustering` is not a greedy algorithm, but is a way to handle potential inconsistencies introduced by greedy algorithms. Once clustering is complete and representative genomes are chosen, this option run an additional round of clustering on the final genome set. This is especially useful when greedy clustering is performed and/or to handle cases where similar genomes end up in different primary clusters. It's essentially a check to make sure that all genomes are as distinct from one another as they should be based on the parameters given. -What is the minimum genome completeness allowed in analysis? ------------------------------------------------------------- +5. Importance of genome completeness +---------------------------------------- This decision is much more complicated than the previous. Essentially, there exists a trade-off between computational efficiency and the minimum genome completeness. @@ -107,9 +137,47 @@ A final piece to consider is that when running dRep for real, the user doesn't a * Going below 50% completeness is not recommended. The resulting genomes will be very crappy anyways, and even the secondary algorithms break-down at this point. * Lowering the secondary ANI should result in a consummate lowering in MASH ANI. This is because you want Mash to group non-similar *and* incomplete genomes. - * To make sure clusters are not being split unnecessarily, you can run the warnings at the end. See :doc:`module_descriptions` for info -The Rest --------- +6. Oddities of hierarchical clustering +---------------------------------------------- + +The most common "bug" reported to dRep is that genome pairs with an ANI greater than the threshold end up in different clusters, or genome pairs with ANI less than the threshold end up in the same cluster. This almost always happens because of hierarchical clustering, the way that dRep transforms pair-wise ANI values into clusters. + +By default, hierarchical clustering is performed in dRep using average linkage (https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html) which can result in scenarios like those described above. If you want all cases where two genomes are over your thresholds to be in the same cluster, you can run it in single mode (`--clusterAlg` single). The problem is that this can create big clusters- for example if A is similar to B, and B is similar to C, but A and C are not similar, what do you do? In single mode A, B, and C will be in the same cluster, and in average mode some averaging is done to try and handle this. + +dRep can use any method of linkage listed at the following webpage by using the --clusterAlg argument: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html . + +.. note:: + + dRep also generates dendrogram figures to visualize the hierarchical clustering (though if you have too many genomes this can get too big to be visualized). `This blog post `_ was instrumental in my understanding of how hierarchical clustering works and how to implement it in python. + +7. Overview of genome comparison algorithms +---------------------------------------------- + +**Primary clustering** is always performed with `Mash `_; an extremely fast but somewhat inaccurate algorithm. + +There are several supported **secondary clustering algorithms**. These calculate the accurate Average Nucleotide Identity (ANI) between genomes that is used to cluster genomes into secondary clusters. The following algorithms are currently supported as of version 3: + +* **ANIn** (`Richter 2009 `_). This aligns whole genomes with nucmer and compares the aligned regions. +* **ANImf** (DEFAULT). This is the same as ANIn, but filters the alignments such that each region of genome 1 and only align to a single region of genome 2. This takes slightly more time, but is much more accurate on genomes with repeat regions +* **gANI** (`Varghese 2015 `_). This aligns genes (ORFs) called by Prodigal instead of aligning whole genomes. This algorithm is a bit faster than ANIm-based algorithms, but only aligns coding regions. +* **goANI**. This is my own open-source implementation of gANI, which is not open source (and for which the authors would not share the source code when asked). I wrote this algorithm so that I could calculate dN/dS between aligned genes for `this study `_ (you can too using `dnds_from_drep.py `_). Requires the program `NSimScan `_. +* **FastANI** (`Jain 2018 `_). A really fast Mash-based algorithm that can also handle incomplete genomes. Seems to be just as accurate as alignment-based algorithms. **Should probably be the default algorithm when you care about runtime.*** + +.. note:: + None of these algorithms are perfect, especially in repeat-prone genomes. Regions of the genome which are not homologous can align to each other and artificially decrease ANI. In fact, when a genome is compared to itself, the algorithms often reports values <100% for this reason. + +8. Comparing and dereplicating non-bacterial genomes +----------------------------------------------------- + +dRep was developed for the use-case of bacterial dereplication, and there are some things to be aware of when running it on non-bacterial entities. + +A major thing to be aware of is primary clustering. As described in :ref:`5. Importance of genome completeness`, genomes need to be >50% complete for primary clustering to work. If you're comparing entities in which you cannot assess completeness or in which you want to compare genomes that share only a limited number of genes (e.g. phage or plasmids), this a problem. The easiest way to handle it is to avoid primary cluster altogether with the parameter `--SkipMash`, or lower the primary clustering threshold with `-pa`. + +Also consider the effect of alignment coverage (:ref:`2. Minimum alignment coverage`) on hierarchical clustering (:ref:`6. Oddities of hierarchical clustering`). If your working with entities that are especially mosaic, like phage, this can be a bigger problem than with bacteria. + +Genome filtering and scoring is also a major factor. If your genomes can't be assessed by checkM, you can turn off quality filtering and the use of completeness and contamination when picking genomes with the flag `--ignoreGenomeQuality`. + +When considering these options for my own studies (`Olm 2019 `_ and `Olm 2020 `_), I landed on the following dRep command for clustering bacteriophages and plasmids. Please take this for what it is, one person's attempt to handle these parameters for their specific use-case, and don't be afraid to make additional adjustments as you see fit: :: -The most important and confusing parameters are described above. For information on the other parameters, see :doc:`module_descriptions` + dRep dereplicate --S_algorithm ANImf -nc .5 -l 10000 -N50W 0 -sizeW 1 --ignoreGenomeQuality --clusterAlg single \ No newline at end of file diff --git a/docs/installation.rst b/docs/installation.rst index 62650e7..b4a372e 100644 --- a/docs/installation.rst +++ b/docs/installation.rst @@ -20,6 +20,13 @@ That's it! Pip is a great package with many options to change the installation parameters in various ways. For details, see `pip documentation `_ +Using conda +---------------- + +To install dRep with conda, simply run :: + + conda config --add channels bioconda; conda install drep + Dependencies ------------ @@ -27,22 +34,24 @@ dRep requires other programs to run. Not all dependencies are needed for all ope To check which dependencies are installed on your system and accessible by dRep, run :: - $ dRep bonus testDir --check_dependencies + $ dRep check_dependencies **Near Essential** * `Mash `_ - Makes primary clusters (v1.1.1 confirmed works) -* `MUMmer `_ - Performs ANIm comparison method (v3.23 confirmed works) +* `MUMmer `_ - Performs default ANIm comparison method (v3.23 confirmed works) **Recommended** +* `fastANI `_ - A fast secondary clustering algorithm * `CheckM `_ - Determines contamination and completeness of genomes (v1.0.7 confirmed works) * `gANI (aka ANIcalculator) `_ - Performs gANI comparison method (v1.0 confirmed works) * `Prodigal `_ - Used be both checkM and gANI (v2.6.3 confirmed works) **Accessory** -* `Centrifuge `_ - Performs taxonomic assignment of bins (v1.0.3 confirmed works) +* `NSimScan `_ - Only needed for goANI algorithm +* `Centrifuge `_ - Deprecated; not used anymore Programs need to be installed to the system path, so that you can call them from anywhere on your computer. @@ -52,6 +61,6 @@ Programs need to be installed to the system path, so that you can call them from CheckM ------ +------- -CheckM is the program that dRep uses to calculate completeness and contamination. It's not required for all dRep commands, but if you'd like to de-dereplicate your genome set using completeness and contamination it is required. It takes a bit of work to install (including setting a root directory and downloading a reference database). Installation instructions can be found `here ` +CheckM is the program that dRep uses to calculate completeness and contamination. It's not required for all dRep commands, but if you'd like to de-dereplicate your genome set using completeness and contamination it is required. It takes a bit of work to install (including setting a root directory and downloading a reference database). Installation instructions can be found `here `_ diff --git a/docs/overview.rst b/docs/overview.rst index 20d938e..4d70672 100644 --- a/docs/overview.rst +++ b/docs/overview.rst @@ -39,3 +39,25 @@ The steps to this process are: * Bin each assembly (and co-assembly) separately using your favorite binner * Pull the bins from all assemblies together and run dRep on them * Perform downstream analysis on the de-replicated genome list + +.. note:: + + See the publication `To Dereplicate or Not To Dereplicate? `_ for an outside perspective on the pro's and con's of dereplication + +A dRep based metagenomic workflow +---------------------------------- + +One method of genome-resolved metagenomic analysis that I am fond of involves the following steps: + +1) Assemble and bin all metagenomic samples to generate a large database of metagenome-assembled genomes (MAGs) + +2) Dereplicate all of these genomes into a set of species-level representative genomes (SRGs). This is done with dRep using a 95% ANI threshold. + +3) Create a mapping database from all SRGs, and map all metagenomes to this database. Each sample will now be represented by a `.bam` file mapped against the same database. + +4) Profile all .bam files using `inStrain `_. This provides a huge number of metrics about the species-level composition of each metagenomes. + +5) Preform strain-level comparisons between metagenomes using `inStrain `_ + +Some example publications in which this workflow is used are `here `_ and `there `_. + diff --git a/docs/quick_start.rst b/docs/quick_start.rst index 9756e0e..2cb74ac 100644 --- a/docs/quick_start.rst +++ b/docs/quick_start.rst @@ -1,34 +1,30 @@ Quick Start =========== -The functionality of dRep is broken up into modules. The modules can be run separately (see :doc:`module_descriptions`), or together in workflows. To see a list of the available modules, check the help:: +The functionality of dRep is broken up into modules. To see a list of the available modules, check the help:: - $ dRep -h + $ dRep -h + ...::: dRep v3.0.0 :::... - ...::: dRep v2.4.0 :::... + Matt Olm. MIT License. Banfield Lab, UC Berkeley. 2017 (last updated 2020) - Matt Olm. MIT License. Banfield Lab, UC Berkeley. 2017 + See https://drep.readthedocs.io/en/latest/index.html for documentation + Choose one of the operations below for more detailed help. - Choose one of the operations below for more detailed help. See https://drep.readthedocs.io/en/latest/index.html for documentation - Example: dRep dereplicate -h + Example: dRep dereplicate -h - Workflows: - compare -> Perform rapid pair-wise comparison on a list of genomes - dereplicate -> De-replicate a list of genomes + Commands: + compare -> Compare and cluster a set of genomes + dereplicate -> De-replicate a set of genomes + check_dependencies -> Check which dependencies are properly installed - Single operations: - filter -> Filter a genome list based on size, completeness, and/or contamination - cluster -> Compare and cluster a genome list based on MASH and ANIn/gANI - choose -> Choose the best genome from each genome cluster - evaluate -> Evaluate genome de-replication - bonus -> Other random operations (determine taxonomy / check dependencies) -De-replication +Dereplication --------------- -De-replication is the process of identifying groups of genomes that are the "same" in a genome set, and removing all but the "best" genome from each redundant set. How similar genomes need to be to be considered "same", how the "best" genome is chosen, and other options can be adjusted (see :doc:`choosing_parameters`) +Dereplication is the process of identifying groups of genomes that are the "same" in a genome set and identifying the "best" genome from each set. How similar genomes need to be to be considered "same" and how the "best" genome is chosen are study-specific decisions that can be adjusted (see :doc:`choosing_parameters`) -To de-replicate a set of genomes, run the following command:: +To dereplicate a set of genomes, run the following command:: $ dRep dereplicate outout_directory -g path/to/genomes/*.fasta