From 9689a8310f449bb43a21b921971a67447012639b Mon Sep 17 00:00:00 2001 From: Matt Olm Date: Fri, 8 Jan 2021 15:56:56 -0800 Subject: [PATCH] docs update --- docs/advanced_use.rst | 8 +- docs/module_descriptions.rst | 626 ++++++++++++++++++----------------- 2 files changed, 323 insertions(+), 311 deletions(-) diff --git a/docs/advanced_use.rst b/docs/advanced_use.rst index 48b25fa..dca5135 100644 --- a/docs/advanced_use.rst +++ b/docs/advanced_use.rst @@ -111,9 +111,7 @@ For example, say you've already run the dereplicate_wf using gANI and want to ru API --- -See :doc:`source/drep` for the API to dRep. - -For example:: +See :doc:`source/drep` for the API to dRep. For example:: from drep.WorkDirectory import WorkDirectory @@ -122,4 +120,6 @@ For example:: Cdb = wd.get_db('Cdb') ... -This will work for all datatables +This will work for all datatables. + +Be warned that the API is not very well maintained or documented, and you'll likely have to do a bit of digging into the source code if you want to use it extensively. diff --git a/docs/module_descriptions.rst b/docs/module_descriptions.rst index 307d09b..7a3ab55 100644 --- a/docs/module_descriptions.rst +++ b/docs/module_descriptions.rst @@ -1,43 +1,333 @@ -Module Descriptions +User manual =================== -The functionality of dRep is broken up into modules. The user can run the modules separately, or together in workflows. For example, you could run:: +dRep has 3 commands: compare, dereplicate, and check dependencies. To see a list of these options check the help:: - $ dRep filter example_workD -g path/to/genomes*.fasta + $ dRep -h - $ dRep cluster example_workD + ...::: dRep v3.0.0 :::... - $ dRep analyze example_workD -pl a + Matt Olm. MIT License. Banfield Lab, UC Berkeley. 2017 (last updated 2020) -OR:: + See https://drep.readthedocs.io/en/latest/index.html for documentation + Choose one of the operations below for more detailed help. - $ dRep compare example_workD -g path/to/genomes*.fasta + Example: dRep dereplicate -h -There are two ways of doing the same thing. To see a list of available modules, check the help:: + Commands: + compare -> Compare and cluster a set of genomes + dereplicate -> De-replicate a set of genomes + check_dependencies -> Check which dependencies are properly installed - $ dRep -h +In previous versions of dRep (everything before v3) the user could run a number of additional modules separately, but now they can only be run as part of the larger workflows `compare` and `dereplicate`. Many of the modules are the same for `compare` and `dereplicate`, however, and in cases where these is the same parameter in both it functions exactly the same in each. - ...::: dRep v2.0.0 :::... +dRep has descriptions in the program help for all the adjustable parameters. If any of these are particularly confusing, don't hesitate to send an email to ask what it does. - Choose one of the operations below for more detailed help. - Example: dRep dereplicate -h - Workflows: - dereplicate -> Combine several of the operations below to de-replicate a genome list - compare -> Simply compare a list of genomes +.. seealso:: - Single operations: - filter -> Filter a genome list based on size, completeness, and/or contamination - cluster -> Compare and cluster a genome list based on MASH and ANIn/gANI - choose -> Choose the best genome from each genome cluster - evaluate -> Evaluate genome de-replication - bonus -> Other random operations (currently just determine taxonomy) - analyze -> Make figures related to the above operations; test alternative clustering + :doc:`Important Concepts` + for theoretical thoughts about how to choose appropriate parameters and thresholds + + :doc:`example_output` + for help interpreting the output from your run in the work directory + + :doc:`advanced_use` + for access to the raw output data and the python API + +Compare +-------- + +This workflow compares a set of genomes. For a list of all parameters, check the help:: + + $ dRep compare -h + usage: dRep compare [-p PROCESSORS] [-d] [-h] [-g [GENOMES [GENOMES ...]]] + [--S_algorithm {fastANI,gANI,goANI,ANIn,ANImf}] + [-ms MASH_SKETCH] [--SkipMash] [--SkipSecondary] + [--n_PRESET {normal,tight}] [-pa P_ANI] [-sa S_ANI] + [-nc COV_THRESH] [-cm {total,larger}] + [--clusterAlg {median,weighted,single,complete,average,ward,centroid}] + [--multiround_primary_clustering] + [--primary_chunksize PRIMARY_CHUNKSIZE] + [--greedy_secondary_clustering] + [--run_tertiary_clustering] [--warn_dist WARN_DIST] + [--warn_sim WARN_SIM] [--warn_aln WARN_ALN] + work_directory + + positional arguments: + work_directory Directory where data and output are stored + *** USE THE SAME WORK DIRECTORY FOR ALL DREP OPERATIONS *** + + SYSTEM PARAMETERS: + -p PROCESSORS, --processors PROCESSORS + threads (default: 6) + -d, --debug make extra debugging output (default: False) + -h, --help show this help message and exit + + GENOME INPUT: + -g [GENOMES [GENOMES ...]], --genomes [GENOMES [GENOMES ...]] + genomes to filter in .fasta format. Not necessary if + Bdb or Wdb already exist. Can also input a text file + with paths to genomes, which results in fewer OS + issues than wildcard expansion (default: None) + + GENOME COMPARISON OPTIONS: + --S_algorithm {fastANI,gANI,goANI,ANIn,ANImf} + Algorithm for secondary clustering comaprisons: + fastANI = Kmer-based approach; very fast + ANImf = (DEFAULT) Align whole genomes with nucmer; filter alignment; compare aligned regions + ANIn = Align whole genomes with nucmer; compare aligned regions + gANI = Identify and align ORFs; compare aligned ORFS + goANI = Open source version of gANI; requires nsmimscan + (default: ANImf) + -ms MASH_SKETCH, --MASH_sketch MASH_SKETCH + MASH sketch size (default: 1000) + --SkipMash Skip MASH clustering, just do secondary clustering on + all genomes (default: False) + --SkipSecondary Skip secondary clustering, just perform MASH + clustering (default: False) + --n_PRESET {normal,tight} + Presets to pass to nucmer + tight = only align highly conserved regions + normal = default ANIn parameters (default: normal) + + GENOME CLUSTERING OPTIONS: + -pa P_ANI, --P_ani P_ANI + ANI threshold to form primary (MASH) clusters + (default: 0.9) + -sa S_ANI, --S_ani S_ANI + ANI threshold to form secondary clusters (default: + 0.99) + -nc COV_THRESH, --cov_thresh COV_THRESH + Minmum level of overlap between genomes when doing + secondary comparisons (default: 0.1) + -cm {total,larger}, --coverage_method {total,larger} + Method to calculate coverage of an alignment + (for ANIn/ANImf only; gANI and fastANI can only do larger method) + total = 2*(aligned length) / (sum of total genome lengths) + larger = max((aligned length / genome 1), (aligned_length / genome2)) + (default: larger) + --clusterAlg {median,weighted,single,complete,average,ward,centroid} + Algorithm used to cluster genomes (passed to + scipy.cluster.hierarchy.linkage (default: average) + + GREEDY CLUSTERING OPTIONS + These decrease RAM use and runtime at the expense of a minor loss in accuracy. + Recommended when clustering 5000+ genomes: + --multiround_primary_clustering + Cluster each primary clunk separately and merge at the + end with single linkage. Decreases RAM usage and + increases speed, and the cost of a minor loss in + precision and the inability to plot + primary_clustering_dendrograms. Especially helpful + when clustering 5000+ genomes. Will be done with + single linkage clustering (default: False) + --primary_chunksize PRIMARY_CHUNKSIZE + Impacts multiround_primary_clustering. If you have + more than this many genomes, process them in chunks of + this size. (default: 5000) + --greedy_secondary_clustering + Use a heuristic to avoid pair-wise comparisons when + doing secondary clustering. Will be done with single + linkage clustering. Only works for fastANI S_algorithm + option at the moment (default: False) + --run_tertiary_clustering + Run an additional round of clustering on the final + genome set. This is especially useful when greedy + clustering is performed and/or to handle cases where + similar genomes end up in different primary clusters. + Only works with dereplicate, not compare. (default: + False) + + WARNINGS: + --warn_dist WARN_DIST + How far from the threshold to throw cluster warnings + (default: 0.25) + --warn_sim WARN_SIM Similarity threshold for warnings between dereplicated + genomes (default: 0.98) + --warn_aln WARN_ALN Minimum aligned fraction for warnings between + dereplicated genomes (ANIn) (default: 0.25) + + Example: dRep compare output_dir/ -g /path/to/genomes/*.fasta + + +Dereplicate +------------ + +This workflow dereplicates a set of genomes. For a list of all parameters, check the help:: + + $ dRep dereplicate -h + usage: dRep dereplicate [-p PROCESSORS] [-d] [-h] [-g [GENOMES [GENOMES ...]]] + [-l LENGTH] [-comp COMPLETENESS] [-con CONTAMINATION] + [--ignoreGenomeQuality] [--genomeInfo GENOMEINFO] + [--checkM_method {taxonomy_wf,lineage_wf}] + [--set_recursion SET_RECURSION] + [--S_algorithm {goANI,ANIn,gANI,ANImf,fastANI}] + [-ms MASH_SKETCH] [--SkipMash] [--SkipSecondary] + [--n_PRESET {normal,tight}] [-pa P_ANI] [-sa S_ANI] + [-nc COV_THRESH] [-cm {total,larger}] + [--clusterAlg {single,ward,complete,weighted,centroid,median,average}] + [--multiround_primary_clustering] + [--primary_chunksize PRIMARY_CHUNKSIZE] + [--greedy_secondary_clustering] + [--run_tertiary_clustering] + [-comW COMPLETENESS_WEIGHT] + [-conW CONTAMINATION_WEIGHT] + [-strW STRAIN_HETEROGENEITY_WEIGHT] [-N50W N50_WEIGHT] + [-sizeW SIZE_WEIGHT] [-centW CENTRALITY_WEIGHT] + [--warn_dist WARN_DIST] [--warn_sim WARN_SIM] + [--warn_aln WARN_ALN] + work_directory + + positional arguments: + work_directory Directory where data and output are stored + *** USE THE SAME WORK DIRECTORY FOR ALL DREP OPERATIONS *** + + SYSTEM PARAMETERS: + -p PROCESSORS, --processors PROCESSORS + threads (default: 6) + -d, --debug make extra debugging output (default: False) + -h, --help show this help message and exit + + GENOME INPUT: + -g [GENOMES [GENOMES ...]], --genomes [GENOMES [GENOMES ...]] + genomes to filter in .fasta format. Not necessary if + Bdb or Wdb already exist. Can also input a text file + with paths to genomes, which results in fewer OS + issues than wildcard expansion (default: None) + + GENOME FILTERING OPTIONS: + -l LENGTH, --length LENGTH + Minimum genome length (default: 50000) + -comp COMPLETENESS, --completeness COMPLETENESS + Minumum genome completeness (default: 75) + -con CONTAMINATION, --contamination CONTAMINATION + Maximum genome contamination (default: 25) + + GENOME QUALITY ASSESSMENT OPTIONS: + --ignoreGenomeQuality + Don't run checkM or do any quality filtering. NOT + RECOMMENDED! This is useful for use with + bacteriophages or eukaryotes or things where checkM + scoring does not work. Will only choose genomes based + on length and N50 (default: False) + --genomeInfo GENOMEINFO + location of .csv file containing quality information + on the genomes. Must contain: ["genome"(basename of + .fasta file of that genome), "completeness"(0-100 + value for completeness of the genome), + "contamination"(0-100 value of the contamination of + the genome)] (default: None) + --checkM_method {taxonomy_wf,lineage_wf} + Either lineage_wf (more accurate) or taxonomy_wf + (faster) (default: lineage_wf) + --set_recursion SET_RECURSION + Increases the python recursion limit. NOT RECOMMENDED + unless checkM is crashing due to recursion issues. + Recommended to set to 2000 if needed, but setting this + could crash python (default: 0) + + GENOME COMPARISON OPTIONS: + --S_algorithm {goANI,ANIn,gANI,ANImf,fastANI} + Algorithm for secondary clustering comaprisons: + fastANI = Kmer-based approach; very fast + ANImf = (DEFAULT) Align whole genomes with nucmer; filter alignment; compare aligned regions + ANIn = Align whole genomes with nucmer; compare aligned regions + gANI = Identify and align ORFs; compare aligned ORFS + goANI = Open source version of gANI; requires nsmimscan + (default: ANImf) + -ms MASH_SKETCH, --MASH_sketch MASH_SKETCH + MASH sketch size (default: 1000) + --SkipMash Skip MASH clustering, just do secondary clustering on + all genomes (default: False) + --SkipSecondary Skip secondary clustering, just perform MASH + clustering (default: False) + --n_PRESET {normal,tight} + Presets to pass to nucmer + tight = only align highly conserved regions + normal = default ANIn parameters (default: normal) + + GENOME CLUSTERING OPTIONS: + -pa P_ANI, --P_ani P_ANI + ANI threshold to form primary (MASH) clusters + (default: 0.9) + -sa S_ANI, --S_ani S_ANI + ANI threshold to form secondary clusters (default: + 0.99) + -nc COV_THRESH, --cov_thresh COV_THRESH + Minmum level of overlap between genomes when doing + secondary comparisons (default: 0.1) + -cm {total,larger}, --coverage_method {total,larger} + Method to calculate coverage of an alignment + (for ANIn/ANImf only; gANI and fastANI can only do larger method) + total = 2*(aligned length) / (sum of total genome lengths) + larger = max((aligned length / genome 1), (aligned_length / genome2)) + (default: larger) + --clusterAlg {single,ward,complete,weighted,centroid,median,average} + Algorithm used to cluster genomes (passed to + scipy.cluster.hierarchy.linkage (default: average) + + GREEDY CLUSTERING OPTIONS + These decrease RAM use and runtime at the expense of a minor loss in accuracy. + Recommended when clustering 5000+ genomes: + --multiround_primary_clustering + Cluster each primary clunk separately and merge at the + end with single linkage. Decreases RAM usage and + increases speed, and the cost of a minor loss in + precision and the inability to plot + primary_clustering_dendrograms. Especially helpful + when clustering 5000+ genomes. Will be done with + single linkage clustering (default: False) + --primary_chunksize PRIMARY_CHUNKSIZE + Impacts multiround_primary_clustering. If you have + more than this many genomes, process them in chunks of + this size. (default: 5000) + --greedy_secondary_clustering + Use a heuristic to avoid pair-wise comparisons when + doing secondary clustering. Will be done with single + linkage clustering. Only works for fastANI S_algorithm + option at the moment (default: False) + --run_tertiary_clustering + Run an additional round of clustering on the final + genome set. This is especially useful when greedy + clustering is performed and/or to handle cases where + similar genomes end up in different primary clusters. + Only works with dereplicate, not compare. (default: + False) + + SCORING CRITERIA + Based off of the formula: + A*Completeness - B*Contamination + C*(Contamination * (strain_heterogeneity/100)) + D*log(N50) + E*log(size) + F*(centrality - S_ani) + + A = completeness_weight; B = contamination_weight; C = strain_heterogeneity_weight; D = N50_weight; E = size_weight; F = cent_weight: + -comW COMPLETENESS_WEIGHT, --completeness_weight COMPLETENESS_WEIGHT + completeness weight (default: 1) + -conW CONTAMINATION_WEIGHT, --contamination_weight CONTAMINATION_WEIGHT + contamination weight (default: 5) + -strW STRAIN_HETEROGENEITY_WEIGHT, --strain_heterogeneity_weight STRAIN_HETEROGENEITY_WEIGHT + strain heterogeneity weight (default: 1) + -N50W N50_WEIGHT, --N50_weight N50_WEIGHT + weight of log(genome N50) (default: 0.5) + -sizeW SIZE_WEIGHT, --size_weight SIZE_WEIGHT + weight of log(genome size) (default: 0) + -centW CENTRALITY_WEIGHT, --centrality_weight CENTRALITY_WEIGHT + Weight of (centrality - S_ani) (default: 1) + + WARNINGS: + --warn_dist WARN_DIST + How far from the threshold to throw cluster warnings + (default: 0.25) + --warn_sim WARN_SIM Similarity threshold for warnings between dereplicated + genomes (default: 0.98) + --warn_aln WARN_ALN Minimum aligned fraction for warnings between + dereplicated genomes (ANIn) (default: 0.25) + + Example: dRep dereplicate output_dir/ -g /path/to/genomes/*.fasta Work Directory -------------- -The work directory is where all of the program's internal workings, log files, cached data, and output is stored. When running dRep modules multiple times on the same dataset, **it is essential** that you use the same work directory so the program can find the results of previous runs. +The work directory is where all of the program's internal workings, log files, cached data, and output is stored. .. seealso:: @@ -47,300 +337,22 @@ The work directory is where all of the program's internal workings, log files, c :doc:`advanced_use` for access to the raw internal data (which can be very useful) -Compare and Dereplicate ------- -These are higher-level operations that call the modules below in succession. - -Compare runs the modules: - -* cluster -* bonus -* evaluate -* analyze - -Dereplicate runs the modules: - -* filter -* cluster -* choose -* bonus -* evaluate -* analyze - -Filter ------- +Genome filtering +----------------- -Filter is used filter the genome set (for why this is necessary, see :doc:`choosing_parameters`). This is done using checkM. All genomes which don't pass the length threshold are filtered first to avoid running checkM unnecessarily. All genomes which don't pass checkM thresholds are filtered before comparisons are run to avoid running comparisons unnecessarily. +In the `dereplicate` module, the genome set is quality filtered first (for why this is necessary, see :doc:`choosing_parameters`). This is done using checkM. All genomes which don't pass the length threshold are filtered first to avoid running checkM unnecessarily. All genomes which don't pass checkM thresholds are filtered before comparisons are run to avoid running comparisons unnecessarily. .. warning:: All genomes must have at least one ORF called or else checkM will stall, so a length minimum of at least 10,000bp is recommended. -To see the command-line options, check the help:: - - $ dRep filter -h - usage: dRep filter [-p PROCESSORS] [-d] [-h] [-l LENGTH] [-comp COMPLETENESS] - [-con CONTAMINATION] [--ignoreGenomeQuality] - [-g [GENOMES [GENOMES ...]]] [--genomeInfo GENOMEINFO] - [--checkM_method {taxonomy_wf,lineage_wf}] - [--set_recursion SET_RECURSION] - work_directory - - positional arguments: - work_directory Directory where data and output - *** USE THE SAME WORK DIRECTORY FOR ALL DREP OPERATIONS *** - - SYSTEM PARAMETERS: - -p PROCESSORS, --processors PROCESSORS - threads (default: 6) - -d, --debug make extra debugging output (default: False) - -h, --help show this help message and exit - - FILTERING OPTIONS: - -l LENGTH, --length LENGTH - Minimum genome length (default: 50000) - -comp COMPLETENESS, --completeness COMPLETENESS - Minumum genome completeness (default: 75) - -con CONTAMINATION, --contamination CONTAMINATION - Maximum genome contamination (default: 25) - --ignoreGenomeQuality - Don't run checkM or do any quality filtering. NOT - RECOMMENDED! This is useful for use with - bacteriophages or eukaryotes or things where checkM - scoring does not work. Will only choose genomes based - on length and N50 (default: False) - - I/O PARAMETERS: - -g [GENOMES [GENOMES ...]], --genomes [GENOMES [GENOMES ...]] - genomes to filter in .fasta format. Not necessary if - Bdb or Wdb already exist (default: None) - --genomeInfo GENOMEINFO - location of .csv file containing quality information - on the genomes. Must contain: ["genome"(basename of - .fasta file of that genome), "completeness"(0-100 - value for completeness of the genome), - "contamination"(0-100 value of the contamination of - the genome)] (default: None) - --checkM_method {taxonomy_wf,lineage_wf} - Either lineage_wf (more accurate) or taxonomy_wf - (faster) (default: lineage_wf) - --set_recursion SET_RECURSION - Increases the python recursion limit. NOT RECOMMENDED - unless checkM is crashing due to recursion issues. - Recommended to set to 2000 if needed, but setting this - could crash python (default: 0) - -Cluster -------- - -Cluster is the module that does the actual primary and secondary comparisons. Choosing parameters here can get a bit complicated- see :doc:`choosing_parameters` for information. - -To see the command-line options, check the help:: - - $ dRep cluster -h - usage: dRep cluster [-p PROCESSORS] [-d] [-h] [-ms MASH_SKETCH] - [--S_algorithm {ANIn,gANI,ANImf,goANI}] - [-n_PRESET {normal,tight}] [-pa P_ANI] [-sa S_ANI] - [--SkipMash] [--SkipSecondary] [-nc COV_THRESH] - [-cm {total,larger}] [--clusterAlg CLUSTERALG] - [-g [GENOMES [GENOMES ...]]] - work_directory - - positional arguments: - work_directory Directory where data and output - *** USE THE SAME WORK DIRECTORY FOR ALL DREP OPERATIONS *** - - SYSTEM PARAMETERS: - -p PROCESSORS, --processors PROCESSORS - threads (default: 6) - -d, --debug make extra debugging output (default: False) - -h, --help show this help message and exit - - GENOME COMPARISON PARAMETERS: - -ms MASH_SKETCH, --MASH_sketch MASH_SKETCH - MASH sketch size (default: 1000) - --S_algorithm {ANIn,gANI,ANImf,goANI} - Algorithm for secondary clustering comaprisons: - ANImf = (RECOMMENDED) Align whole genomes with nucmer; filter alignment; compare aligned regions - ANIn = Align whole genomes with nucmer; compare aligned regions - gANI = Identify and align ORFs; compare aligned ORFS - (default: ANImf) - -n_PRESET {normal,tight} - Presets to pass to nucmer - tight = only align highly conserved regions - normal = default ANIn parameters (default: normal) - - CLUSTERING PARAMETERS: - -pa P_ANI, --P_ani P_ANI - ANI threshold to form primary (MASH) clusters - (default: 0.9) - -sa S_ANI, --S_ani S_ANI - ANI threshold to form secondary clusters (default: - 0.99) - --SkipMash Skip MASH clustering, just do secondary clustering on - all genomes (default: False) - --SkipSecondary Skip secondary clustering, just perform MASH - clustering (default: False) - -nc COV_THRESH, --cov_thresh COV_THRESH - Minmum level of overlap between genomes when doing - secondary comparisons (default: 0.1) - -cm {total,larger}, --coverage_method {total,larger} - Method to calculate coverage of an alignment - (for ANIn/ANImf only; gANI can only do larger method) - total = 2*(aligned length) / (sum of total genome lengths) - larger = max((aligned length / genome 1), (aligned_length / genome2)) - (default: larger) - --clusterAlg CLUSTERALG - Algorithm used to cluster genomes (passed to - scipy.cluster.hierarchy.linkage (default: average) - - I/O PARAMETERS: - -g [GENOMES [GENOMES ...]], --genomes [GENOMES [GENOMES ...]] - genomes to cluster in .fasta format. Not necessary if - already loaded sequences with the "filter" operation - (default: None) - -Choose ------- - -Choose is the module that picks the best genome from each secondary cluster identified in **Cluster**. It does this based off of the formula: - -.. math:: score = A(completeness) – B(contamination) + C(Contamination * (strain_heterogeneity/100)) + D(log(N50)) + E(log(size)) - -Where A-E are command-line arguments, and the genome with the highest score is the "best". By default, A-E are 1,5,1,0.5,0 respectively. - -To see the command-line options, check the help:: - - $ dRep choose -h - usage: dRep choose [-p PROCESSORS] [-d] [-h] [-comW COMPLETENESS_WEIGHT] - [-conW CONTAMINATION_WEIGHT] - [-strW STRAIN_HETEROGENEITY_WEIGHT] [-N50W N50_WEIGHT] - [-sizeW SIZE_WEIGHT] - [--checkM_method {lineage_wf,taxonomy_wf}] - [--genomeInfo GENOMEINFO] [--ignoreGenomeQuality] - work_directory - - positional arguments: - work_directory Directory where data and output - *** USE THE SAME WORK DIRECTORY FOR ALL DREP OPERATIONS *** - - SYSTEM PARAMETERS: - -p PROCESSORS, --processors PROCESSORS - threads (default: 6) - -d, --debug make extra debugging output (default: False) - -h, --help show this help message and exit - - SCORING CRITERIA - Based off of the formula: - A*Completeness - B*Contamination + C*(Contamination * (strain_heterogeneity/100)) + D*log(N50) + E*log(size) - - A = completeness_weight; B = contamination_weight; C = strain_heterogeneity_weight; D = N50_weight; E = size_weight: - -comW COMPLETENESS_WEIGHT, --completeness_weight COMPLETENESS_WEIGHT - completeness weight (default: 1) - -conW CONTAMINATION_WEIGHT, --contamination_weight CONTAMINATION_WEIGHT - contamination weight (default: 5) - -strW STRAIN_HETEROGENEITY_WEIGHT, --strain_heterogeneity_weight STRAIN_HETEROGENEITY_WEIGHT - strain heterogeneity weight (default: 1) - -N50W N50_WEIGHT, --N50_weight N50_WEIGHT - weight of log(genome N50) (default: 0.5) - -sizeW SIZE_WEIGHT, --size_weight SIZE_WEIGHT - weight of log(genome size) (default: 0) - - OTHER: - --checkM_method {lineage_wf,taxonomy_wf} - Either lineage_wf (more accurate) or taxonomy_wf - (faster) (default: lineage_wf) - --genomeInfo GENOMEINFO - location of .csv file containing quality information - on the genomes. Must contain: ["genome"(basename of - .fasta file of that genome), "completeness"(0-100 - value for completeness of the genome), - "contamination"(0-100 value of the contamination of - the genome)] (default: None) - --ignoreGenomeQuality - Don't run checkM or do any quality filtering. NOT - RECOMMENDED! This is useful for use with - bacteriophages or eukaryotes or things where checkM - scoring does not work. Will only choose genomes based - on length and N50 (default: False) - -Analyze -------- - -Analyze is the module that makes all of the figures. - -To see the command-line options, check the help:: - - $ dRep analyze -h - usage: dRep analyze [-p PROCESSORS] [-d] [-h] [-pl [PLOTS [PLOTS ...]]] - work_directory - - positional arguments: - work_directory Directory where data and output - *** USE THE SAME WORK DIRECTORY FOR ALL DREP OPERATIONS *** - - SYSTEM PARAMETERS: - -p PROCESSORS, --processors PROCESSORS - threads (default: 6) - -d, --debug make extra debugging output (default: False) - -h, --help show this help message and exit - - PLOTTING: - -pl [PLOTS [PLOTS ...]], --plots [PLOTS [PLOTS ...]] - Plots. Input 'all' or 'a' to plot all - 1) Primary clustering dendrogram - 2) Secondary clustering dendrograms - 3) Secondary clustering MDS - 4) Comparison scatterplots - 5) Cluster scorring plot - 6) Winning genomes - (default: None) - -Evaluate +Warnings -------- -Evaluate performs a series of checks to alert the user to potential problems with de-replication. It has two things that it can look for: +A series of checks are preformed to alert the user to potential problems with de-replication. There are two things that it looks for: **de-replicated genome similarity**- this is comparing all of the de-replicated genomes to each other and making sure they're not too similar. This is to try and catch cases where similar genomes were split into different primary clusters, and thus failed to be de-replicated. *Depending on the number of de-replicated genomes, this can take a while* **secondary clusters that were almost different**- this alerts you to cases where genomes are on the edge between being considered "same" or "different", depending on the clustering parameters you used. *This module reads the parameters you used during clustering from the work directory, so you don't need to specify them again.* -To see the command-line options, check the help:: - - $ dRep evaluate -h - usage: dRep evaluate [-p PROCESSORS] [-d] [-h] [--warn_dist WARN_DIST] - [--warn_sim WARN_SIM] [--warn_aln WARN_ALN] - [-e [EVALUATE [EVALUATE ...]]] - work_directory - - positional arguments: - work_directory Directory where data and output - *** USE THE SAME WORK DIRECTORY FOR ALL DREP OPERATIONS *** - - SYSTEM PARAMETERS: - -p PROCESSORS, --processors PROCESSORS - threads (default: 6) - -d, --debug make extra debugging output (default: False) - -h, --help show this help message and exit - - WARNINGS: - --warn_dist WARN_DIST - How far from the threshold to throw cluster warnings - (default: 0.25) - --warn_sim WARN_SIM Similarity threshold for warnings between dereplicated - genomes (default: 0.98) - --warn_aln WARN_ALN Minimum aligned fraction for warnings between - dereplicated genomes (ANIn) (default: 0.25) - - EVALUATIONS: - -e [EVALUATE [EVALUATE ...]], --evaluate [EVALUATE [EVALUATE ...]] - Things to evaluate Input 'all' or 'a' to evaluate all - 1) Evaluate de-replicated genome similarity - 2) Throw warnings for clusters that were almost different - 3) Generate a database of information on winning genomes - (default: None) - -Bonus ------ - -Bonus consists of operations that don't really fit in with the functions of dRep, but can be helpful. Currently the only thing it can do is determine taxonomy of your bins. This is done using centrifuge, similar to how `anvi'o does it `_. If you choose to use this option, the taxonomy of genome will be shown with the filename in most figures. +Overall these warnings are a bit half-baked, however, and I personally don't pay attention to them when running dRep myself. \ No newline at end of file