dib-lab · bluegenes · Sep 30, 2020 · Oct 1, 2020 · Oct 2, 2020 · Oct 3, 2020
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-[![image](https://travis-ci.org/dib-lab/dammit.svg)](https://travis-ci.org/dib-lab/dammit)
+![tests](https://github.com/github/docs/actions/workflows/tests.yml/badge.svg)
 [![Documentation Status](https://readthedocs.org/projects/dammit/badge/)](http://dammit.readthedocs.io/en/latest)
 
 *"I love writing BLAST parsers!" -- no one, ever*
@@ -19,11 +19,11 @@ Install dammit with (bio)conda:
 
 Download and install a subset of the databases:
 
-    dammit databases --install --quick
+    dammit run --pipeline quick databases --install
 
 And the annotate with:
 
-    dammit annotate <transcriptome_fasta>
+    dammit run --pipeline quick annotate <transcriptome_fasta>
 
 Head over to the [docs](http://dib-lab.github.io/dammit/) for much more detailed
 information!

diff --git a/doc/about.md b/doc/about.md
@@ -2,7 +2,7 @@
 
 This page goes a little more in depth on the software and its goals.
 
-## Motivations
+## Background and Motivations
 
 Several different factors motivated dammit's development. The first of
 these was the sea lamprey transcriptome project, which had annotation as
@@ -24,80 +24,86 @@ Implicit to these motivations is some idea of what a good annotator
 5.  It should be relatively fast
 6.  It should try to be correct, insofar as any computational approach
     can be "correct"
-7.  It should give the user some measure of confidence for its results.
 
 ## The Obligatory Flowchart
 
 ![The Workflow](static/workflow.svg)
 
-## Software Used
-
--   TransDecoder
--   BUSCO
--   HMMER
--   Infernal
--   LAST
--   crb-blast (for now)
--   pydoit (under the hood)
-
-All of these are Free Software, as in freedom and beer
-
-## Databases
+## Databases Used
 
 -   Pfam-A
 -   Rfam
 -   OrthoDB
+-   Swiss-Prot
 -   BUSCO databases
 -   Uniref90
 -   User-supplied protein databases
 
-The last one is important, and sometimes ignored.
+The last one is important, and sometimes ignored. 
+Dammit uses an approach similar to Conditional Reciprocal Best Blast
+to map to user-supplied protein databases (details below).
 
-## Conditional Reciprocal Best LAST
+To see more about the included databases, 
+see the [About Databases](database-about.md) section.
 
-Building off Richard and co's work on Conditional Reciprocal Best
-BLAST, I've implemented a new version with Python and LAST -- CRBL.
-The original lives [here](https://github.com/cboursnell/crb-blast).
-
-Why??
+## Software Used
 
--   BLAST is too slooooooow
--   Ruby is yet another dependency to have users install
--   With Python and scikit learn, I have freedom to toy with models (and
-    learn stuff)
+The specific set of software and databases used can be modified by specifying different [pipelines](pipelines.md).
+The full set of software than can be run is:
 
-And, of course, some of these databases are BIG. Doing `blastx` and
-`tblastn` between a reasonably sized transcriptome and Uniref90 is not
-an experience you want to have.
+-   TransDecoder
+-   BUSCO
+-   HMMER
+-   Infernal
+-   LAST
+-   shmlast (for crb-blast to user-supplied protein databases)
 
-ie, practical concerns.
+All of these are Free Software, as in freedom and beer. 
 
-## A brief intro to CRBB
+### shmlast: Conditional Reciprocal Best LAST for mapping to user databases
 
--   Reciprocal Best Hits (RBH) is a standard method for ortholog
-    detection
--   Transcriptomes have multiple multiple transcript isoforms, which
-    confounds RBH
--   CRBB uses machine learning to get at this problem
+Reciprocal Best Hit mapping (RBH) is a standard method for ortholog detection.
+However, transcriptomes have multiple transcript isoforms, which confound RBH.
 
 ![](static/RBH.svg)
 
-CRBB attempts to associate those isoforms with appropriate annotations
-by learning an appropriate e-value cutoff for different transcript
-lengths.
+**Conditional Reciprocal Best Blast (CRBB)** attempts to associate those isoforms
+with appropriate annotations by learning an appropriate e-value cutoff for 
+different transcript lengths. The original implementation of CRBB 
+can be found [here](https://github.com/cboursnell/crb-blast). 
 
 ![CRBB](static/CRBB_decision.png)
 
-*from
-http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004365\#s5*
-
-## CRBL
-
-For CRBL, instead of fitting a linear model, we train a model.
+*from [Deep Evolutionary Comparison of Gene Expression Identifies Parallel Recruitment of Trans-Factors in Two Independent Origins of C4 Photosynthesis](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004365)*
 
--   SVM
--   Naive bayes
+**shmlast** is a reimplementation of the Conditional Reciprocal Best Hits 
+algorithm for finding potential orthologs between a transcriptome and 
+a species-specific protein database. It uses the LAST aligner and the 
+pydata stack to achieve much better performance while staying in the 
+Python ecosystem. 
 
 One limitation is that LAST has no equivalent to `tblastn`. So, we find
 the RBHs using the TransDecoder ORFs, and then use the model on the
-translated transcriptome versus database hits.
+translated transcriptome versus database hits. 
+
+`shmlast` is published in JOSS, doi:[10.21105/joss.00142](https://joss.theoj.org/papers/10.21105/joss.00142).
+
+
+## The Dammit Software
+
+dammit is built on the [Snakemake](https://snakemake.readthedocs.io/en/stable/)
+workflow management system. This means that the dammit pipeline enjoys all the features of any
+Snakemake workflow: reproducibility, ability to resume, cluster support, and per-task environment
+management. Each step in dammit's pipeline(s) is implemented as a Snakemake
+[wrapper](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#wrappers);
+when dammit is executed, it generates the targets for the pipeline being run as specified in its
+pipelines file and passes them along to the Snakemake executable. The dammit frontend simplifies
+the interface for the user and constrains the inputs and options to ensure the pipeline
+will always run correctly.
+
+One of the essential, and most annoying, parts of annotation is the conversion and collation
+of information from many different file formats. Dammit includes a suite of minimal command
+line utilities implementing a number of these things, including converting several formats
+to GFF3, merging GFF3 files, and filtering alignment results for best hits. More details on
+these utilities can be found in the [components](dammit-components.md) section.
+
diff --git a/doc/annotate.md b/doc/annotate.md
@@ -0,0 +1,147 @@
+# Annotate
+
+The dammit `annotate` component uses the installed databases for transcriptome annotation.
+
+## Just annotate it, dammit!
+
+After you've properly installed the [databases](database-usage.md), you can start the annotation.
+To run the annotation, you only need to provide a set of transcripts to annotate. 
+
+
+    dammit run annotate TRANSCRIPTOME.fasta
+
+
+Optionally, allow `dammit` to use additional threads using the `n_threads` parameter:
+
+    dammit run --n-threads 4 annotate TRANSCRIPTOME.fasta
+
+
+If you'd like to customize the output or other parameters such as the e-value for similarity searches,
+you can provide customization on the command line or in a configuration file.
+
+
+## Additional Usage info 
+
+To see the general dammit usage information, run:
+
+    dammit run --help
+
+You should see the following:
+
+```
+Usage: dammit run [OPTIONS] COMMAND [ARGS]...
+
+  Run the annotation pipeline or install databases.
+
+Options:
+  --database-dir TEXT             Directory to store databases. Existing
+                                  databases will not be overwritten.
+
+  --conda-dir TEXT                Directory to store snakemake-created conda
+                                  environments.
+
+  --temp-dir TEXT                 Directory to store dammit temp files.
+  --busco-group [bacteria_odb10|acidobacteria_odb10|actinobacteria_phylum_odb10|actinobacteria_class_odb10|corynebacteriales_odb10|...]
+                                  BUSCO group(s) to use/install.
+  --n-threads INTEGER             Number of threads for overall workflow
+                                  execution
+
+  --max-threads-per-task INTEGER  Max threads to use for a single step.
+  --busco-config-file TEXT        Path to an alternative BUSCO config file;
+                                  otherwise, BUSCO will attempt to use its
+                                  default installation which will likely only
+                                  work on bioconda. Advanced use only!
+
+  --pipeline [default|quick|full|nr]
+                                  Which pipeline to use. Pipeline options:
+                                  quick: excludes:  the Infernal Rfam tasks,
+                                  the HMMER Pfam tasks, and the LAST OrthoDB
+                                  and uniref90 tasks. Best for users just
+                                  looking to get basic stats and conditional
+                                  reciprocal best LAST from a protein
+                                  database.  full: Run a "complete"
+                                  annotation; includes uniref90, which is left
+                                  out of the default pipeline because it is
+                                  huge and homology searches take a long time.
+                                  nr:  Also include annotation to NR database,
+                                  which is left out of the default and "full"
+                                  pipelines because it is huge and homology
+                                  searches take a long time. More info  at
+                                  https://dib-lab.github.io/dammit.
+
+  --help                          Show this message and exit.
+
+Commands:
+  annotate   The main annotation pipeline.
+  databases  The database preparation pipeline.
+```
+
+The `--pipeline` option can be used to switch the set of databases being used for annotation.
+See the [annotation pipelines](pipelines.md) doc for info about each specific pipeline.
+Note that these pipelines all run a core set of programs. If you re-run an annotation with a
+larger pipeline, dammit will not re-run analyses that have already been completed. Instead,
+dammit will run any new analyses, and integrate them into the final fasta and gff3.
+
+To see annotation-specific configuration info, run:
+
+    dammit run annotate --help
+
+You should see the following:
+
+```
+Usage: dammit run annotate [OPTIONS] TRANSCRIPTOME [EXTRA_SNAKEMAKE_ARGS]...
+
+  The main annotation pipeline. Calculates assembly stats; runs BUSCO; runs
+  LAST against OrthoDB (and optionally uniref90), HMMER against Pfam,
+  Infernal against Rfam, and Conditional Reciprocal Best-hit Blast against
+  user databases; and aggregates all results in a properly formatted GFF3
+  file.
+
+Options:
+  -n, --base-name TEXT       Base name to use for renaming the input
+                             transcripts. The new names will be of the form
+                             <name>_<X>. It should not have spaces, pipes,
+                             ampersands, or other characters with special
+                             meaning to BASH. Superseded by --regex-rename.
+
+  --regex-rename TEXT        Rename transcripts using a regex pattern. The
+                             regex should follow  Python `re` format and
+                             contain a named field keyed as `name` that
+                             extracts the desired string. For example,
+                             providing "(?P<name>^[a-zA-Z0-9\.]+)" will match
+                             from the beginning of the sequence header up to
+                             the first symbol that is not alphanumeric or a
+                             period. Supersedes --base-name.
+
+  --rename / --no-rename     If --no-rename, original transcript names are
+                             preserved in the final annotated FASTA. --base-
+                             name is still used in intermediate files. If
+                             --rename (the default  behavior), the renamed
+                             transcript names are used in the final  annotated
+                             FASTA.
+
+  -e, --global-evalue FLOAT  global e-value cutoff for similarity searches.
+  -o, --output-dir TEXT      Output directory. By default this will be the
+                             name of the transcriptome file with `.dammit`
+                             appended
+
+  -u, --user-database TEXT   Optional additional protein databases.  These
+                             will be searched with CRB-blast.
+
+  --dry-run
+  --help                     Show this message and exit.
+```
+
+Add these options as needed. For example, add annotation to a user database and specify an
+output directory name like so:
+
+    dammit run annotate --user-database DB-FILE --output-dir dammit-results 
+
+General run arguments need to be added in front of `annotate`, e.g. - 
+
+    dammit run --n-threads 4 --pipeline quick annotate --user-database DB-FILE --output-dir dammit-results 
+
+
+
+
+
diff --git a/doc/cluster.md b/doc/cluster.md
@@ -0,0 +1,21 @@
+# Distributing dammit jobs across a cluster
+
+`dammit` can run on a single compute instance, or can submit each individual job to a job scheduler, 
+if you provide the right submission information for your cluster. Job submission is handled
+via snakemake, so please see the [snakemake cluster documentation](https://snakemake.readthedocs.io/en/stable/executing/cluster.html)
+for the most up-to-date version of these instructions.
+
+## Using A Snakemake Profile for Job Submission
+
+### Set up a snakemake profile for your cluster
+
+We recommend using a [snakemake profile](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles)
+to enable job submission for the job scheduler used by your cluster. You can start from the cookiecutter 
+profiles [here](https://github.com/snakemake-profiles/doc) or write your own.
+
+### Direct dammit to use the snakemake profile 
+
+When you'd like dammit to submit jobs to a job scheduler, direct it to use your cluster profile by 
+adding `--profile <profile-folder-name>` at or near the end of your dammit command (after all dammit-specific arguments).
+Again, see the [snakemake profile documentation](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles) for additional information.
+
diff --git a/doc/configuration.md b/doc/configuration.md
@@ -0,0 +1,38 @@
+# Advanced Configuration
+
+Dammit's overall memory and CPU usage can be specified at the command line.
+The [annotation pipelines](pipelines.md) section contains info on the
+recommended minimum resources for each pipeline.
+
+
+Dammit can be configured in two ways:
+
+  - providing options on the command line
+  - providing options within a YAML configuration file.
+
+
+## **`dammit config`**
+
+```
+Usage: dammit config [OPTIONS] COMMAND [ARGS]...
+
+  Show dammit configuration information.
+
+Options:
+  --help  Show this message and exit.
+
+Commands:
+  busco-groups      Lists the available BUSCO group databases.
+  clean-temp        Clear out shared dammit temp files.
+  show-default      Show the selected default configuration file.
+  show-directories  List dammit directory locations.
+```
+
+## Tool-Specific Specification
+
+Tool-specific parameters can be modified via a custom configuration file.
+
+
+
+
+