diff --git a/README.md b/README.md index e38883c8..272e95d7 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -[![image](https://travis-ci.org/dib-lab/dammit.svg)](https://travis-ci.org/dib-lab/dammit) +![tests](https://github.com/github/docs/actions/workflows/tests.yml/badge.svg) [![Documentation Status](https://readthedocs.org/projects/dammit/badge/)](http://dammit.readthedocs.io/en/latest) *"I love writing BLAST parsers!" -- no one, ever* @@ -19,11 +19,11 @@ Install dammit with (bio)conda: Download and install a subset of the databases: - dammit databases --install --quick + dammit run --pipeline quick databases --install And the annotate with: - dammit annotate + dammit run --pipeline quick annotate Head over to the [docs](http://dib-lab.github.io/dammit/) for much more detailed information! diff --git a/doc/about.md b/doc/about.md index 7a1a1bca..1d975cba 100644 --- a/doc/about.md +++ b/doc/about.md @@ -2,7 +2,7 @@ This page goes a little more in depth on the software and its goals. -## Motivations +## Background and Motivations Several different factors motivated dammit's development. The first of these was the sea lamprey transcriptome project, which had annotation as @@ -24,80 +24,86 @@ Implicit to these motivations is some idea of what a good annotator 5. It should be relatively fast 6. It should try to be correct, insofar as any computational approach can be "correct" -7. It should give the user some measure of confidence for its results. ## The Obligatory Flowchart ![The Workflow](static/workflow.svg) -## Software Used - -- TransDecoder -- BUSCO -- HMMER -- Infernal -- LAST -- crb-blast (for now) -- pydoit (under the hood) - -All of these are Free Software, as in freedom and beer - -## Databases +## Databases Used - Pfam-A - Rfam - OrthoDB +- Swiss-Prot - BUSCO databases - Uniref90 - User-supplied protein databases -The last one is important, and sometimes ignored. +The last one is important, and sometimes ignored. +Dammit uses an approach similar to Conditional Reciprocal Best Blast +to map to user-supplied protein databases (details below). -## Conditional Reciprocal Best LAST +To see more about the included databases, +see the [About Databases](database-about.md) section. -Building off Richard and co's work on Conditional Reciprocal Best -BLAST, I've implemented a new version with Python and LAST -- CRBL. -The original lives [here](https://github.com/cboursnell/crb-blast). - -Why?? +## Software Used -- BLAST is too slooooooow -- Ruby is yet another dependency to have users install -- With Python and scikit learn, I have freedom to toy with models (and - learn stuff) +The specific set of software and databases used can be modified by specifying different [pipelines](pipelines.md). +The full set of software than can be run is: -And, of course, some of these databases are BIG. Doing `blastx` and -`tblastn` between a reasonably sized transcriptome and Uniref90 is not -an experience you want to have. +- TransDecoder +- BUSCO +- HMMER +- Infernal +- LAST +- shmlast (for crb-blast to user-supplied protein databases) -ie, practical concerns. +All of these are Free Software, as in freedom and beer. -## A brief intro to CRBB +### shmlast: Conditional Reciprocal Best LAST for mapping to user databases -- Reciprocal Best Hits (RBH) is a standard method for ortholog - detection -- Transcriptomes have multiple multiple transcript isoforms, which - confounds RBH -- CRBB uses machine learning to get at this problem +Reciprocal Best Hit mapping (RBH) is a standard method for ortholog detection. +However, transcriptomes have multiple transcript isoforms, which confound RBH. ![](static/RBH.svg) -CRBB attempts to associate those isoforms with appropriate annotations -by learning an appropriate e-value cutoff for different transcript -lengths. +**Conditional Reciprocal Best Blast (CRBB)** attempts to associate those isoforms +with appropriate annotations by learning an appropriate e-value cutoff for +different transcript lengths. The original implementation of CRBB +can be found [here](https://github.com/cboursnell/crb-blast). ![CRBB](static/CRBB_decision.png) -*from -http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004365\#s5* - -## CRBL - -For CRBL, instead of fitting a linear model, we train a model. +*from [Deep Evolutionary Comparison of Gene Expression Identifies Parallel Recruitment of Trans-Factors in Two Independent Origins of C4 Photosynthesis](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004365)* -- SVM -- Naive bayes +**shmlast** is a reimplementation of the Conditional Reciprocal Best Hits +algorithm for finding potential orthologs between a transcriptome and +a species-specific protein database. It uses the LAST aligner and the +pydata stack to achieve much better performance while staying in the +Python ecosystem. One limitation is that LAST has no equivalent to `tblastn`. So, we find the RBHs using the TransDecoder ORFs, and then use the model on the -translated transcriptome versus database hits. +translated transcriptome versus database hits. + +`shmlast` is published in JOSS, doi:[10.21105/joss.00142](https://joss.theoj.org/papers/10.21105/joss.00142). + + +## The Dammit Software + +dammit is built on the [Snakemake](https://snakemake.readthedocs.io/en/stable/) +workflow management system. This means that the dammit pipeline enjoys all the features of any +Snakemake workflow: reproducibility, ability to resume, cluster support, and per-task environment +management. Each step in dammit's pipeline(s) is implemented as a Snakemake +[wrapper](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#wrappers); +when dammit is executed, it generates the targets for the pipeline being run as specified in its +pipelines file and passes them along to the Snakemake executable. The dammit frontend simplifies +the interface for the user and constrains the inputs and options to ensure the pipeline +will always run correctly. + +One of the essential, and most annoying, parts of annotation is the conversion and collation +of information from many different file formats. Dammit includes a suite of minimal command +line utilities implementing a number of these things, including converting several formats +to GFF3, merging GFF3 files, and filtering alignment results for best hits. More details on +these utilities can be found in the [components](dammit-components.md) section. + diff --git a/doc/annotate.md b/doc/annotate.md new file mode 100644 index 00000000..1a678c42 --- /dev/null +++ b/doc/annotate.md @@ -0,0 +1,147 @@ +# Annotate + +The dammit `annotate` component uses the installed databases for transcriptome annotation. + +## Just annotate it, dammit! + +After you've properly installed the [databases](database-usage.md), you can start the annotation. +To run the annotation, you only need to provide a set of transcripts to annotate. + + + dammit run annotate TRANSCRIPTOME.fasta + + +Optionally, allow `dammit` to use additional threads using the `n_threads` parameter: + + dammit run --n-threads 4 annotate TRANSCRIPTOME.fasta + + +If you'd like to customize the output or other parameters such as the e-value for similarity searches, +you can provide customization on the command line or in a configuration file. + + +## Additional Usage info + +To see the general dammit usage information, run: + + dammit run --help + +You should see the following: + +``` +Usage: dammit run [OPTIONS] COMMAND [ARGS]... + + Run the annotation pipeline or install databases. + +Options: + --database-dir TEXT Directory to store databases. Existing + databases will not be overwritten. + + --conda-dir TEXT Directory to store snakemake-created conda + environments. + + --temp-dir TEXT Directory to store dammit temp files. + --busco-group [bacteria_odb10|acidobacteria_odb10|actinobacteria_phylum_odb10|actinobacteria_class_odb10|corynebacteriales_odb10|...] + BUSCO group(s) to use/install. + --n-threads INTEGER Number of threads for overall workflow + execution + + --max-threads-per-task INTEGER Max threads to use for a single step. + --busco-config-file TEXT Path to an alternative BUSCO config file; + otherwise, BUSCO will attempt to use its + default installation which will likely only + work on bioconda. Advanced use only! + + --pipeline [default|quick|full|nr] + Which pipeline to use. Pipeline options: + quick: excludes: the Infernal Rfam tasks, + the HMMER Pfam tasks, and the LAST OrthoDB + and uniref90 tasks. Best for users just + looking to get basic stats and conditional + reciprocal best LAST from a protein + database. full: Run a "complete" + annotation; includes uniref90, which is left + out of the default pipeline because it is + huge and homology searches take a long time. + nr: Also include annotation to NR database, + which is left out of the default and "full" + pipelines because it is huge and homology + searches take a long time. More info at + https://dib-lab.github.io/dammit. + + --help Show this message and exit. + +Commands: + annotate The main annotation pipeline. + databases The database preparation pipeline. +``` + +The `--pipeline` option can be used to switch the set of databases being used for annotation. +See the [annotation pipelines](pipelines.md) doc for info about each specific pipeline. +Note that these pipelines all run a core set of programs. If you re-run an annotation with a +larger pipeline, dammit will not re-run analyses that have already been completed. Instead, +dammit will run any new analyses, and integrate them into the final fasta and gff3. + +To see annotation-specific configuration info, run: + + dammit run annotate --help + +You should see the following: + +``` +Usage: dammit run annotate [OPTIONS] TRANSCRIPTOME [EXTRA_SNAKEMAKE_ARGS]... + + The main annotation pipeline. Calculates assembly stats; runs BUSCO; runs + LAST against OrthoDB (and optionally uniref90), HMMER against Pfam, + Infernal against Rfam, and Conditional Reciprocal Best-hit Blast against + user databases; and aggregates all results in a properly formatted GFF3 + file. + +Options: + -n, --base-name TEXT Base name to use for renaming the input + transcripts. The new names will be of the form + _. It should not have spaces, pipes, + ampersands, or other characters with special + meaning to BASH. Superseded by --regex-rename. + + --regex-rename TEXT Rename transcripts using a regex pattern. The + regex should follow Python `re` format and + contain a named field keyed as `name` that + extracts the desired string. For example, + providing "(?P^[a-zA-Z0-9\.]+)" will match + from the beginning of the sequence header up to + the first symbol that is not alphanumeric or a + period. Supersedes --base-name. + + --rename / --no-rename If --no-rename, original transcript names are + preserved in the final annotated FASTA. --base- + name is still used in intermediate files. If + --rename (the default behavior), the renamed + transcript names are used in the final annotated + FASTA. + + -e, --global-evalue FLOAT global e-value cutoff for similarity searches. + -o, --output-dir TEXT Output directory. By default this will be the + name of the transcriptome file with `.dammit` + appended + + -u, --user-database TEXT Optional additional protein databases. These + will be searched with CRB-blast. + + --dry-run + --help Show this message and exit. +``` + +Add these options as needed. For example, add annotation to a user database and specify an +output directory name like so: + + dammit run annotate --user-database DB-FILE --output-dir dammit-results + +General run arguments need to be added in front of `annotate`, e.g. - + + dammit run --n-threads 4 --pipeline quick annotate --user-database DB-FILE --output-dir dammit-results + + + + + diff --git a/doc/cluster.md b/doc/cluster.md new file mode 100644 index 00000000..791ac2d0 --- /dev/null +++ b/doc/cluster.md @@ -0,0 +1,21 @@ +# Distributing dammit jobs across a cluster + +`dammit` can run on a single compute instance, or can submit each individual job to a job scheduler, +if you provide the right submission information for your cluster. Job submission is handled +via snakemake, so please see the [snakemake cluster documentation](https://snakemake.readthedocs.io/en/stable/executing/cluster.html) +for the most up-to-date version of these instructions. + +## Using A Snakemake Profile for Job Submission + +### Set up a snakemake profile for your cluster + +We recommend using a [snakemake profile](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles) +to enable job submission for the job scheduler used by your cluster. You can start from the cookiecutter +profiles [here](https://github.com/snakemake-profiles/doc) or write your own. + +### Direct dammit to use the snakemake profile + +When you'd like dammit to submit jobs to a job scheduler, direct it to use your cluster profile by +adding `--profile ` at or near the end of your dammit command (after all dammit-specific arguments). +Again, see the [snakemake profile documentation](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles) for additional information. + diff --git a/doc/configuration.md b/doc/configuration.md new file mode 100644 index 00000000..56f70709 --- /dev/null +++ b/doc/configuration.md @@ -0,0 +1,38 @@ +# Advanced Configuration + +Dammit's overall memory and CPU usage can be specified at the command line. +The [annotation pipelines](pipelines.md) section contains info on the +recommended minimum resources for each pipeline. + + +Dammit can be configured in two ways: + + - providing options on the command line + - providing options within a YAML configuration file. + + +## **`dammit config`** + +``` +Usage: dammit config [OPTIONS] COMMAND [ARGS]... + + Show dammit configuration information. + +Options: + --help Show this message and exit. + +Commands: + busco-groups Lists the available BUSCO group databases. + clean-temp Clear out shared dammit temp files. + show-default Show the selected default configuration file. + show-directories List dammit directory locations. +``` + +## Tool-Specific Specification + +Tool-specific parameters can be modified via a custom configuration file. + + + + + diff --git a/doc/dammit-components.md b/doc/dammit-components.md new file mode 100644 index 00000000..8ff33090 --- /dev/null +++ b/doc/dammit-components.md @@ -0,0 +1,62 @@ +# Dammit Components + +## **`dammit run`** + +`dammit` can run two main workflows, `databases`, and `annotate`. + + - **`databases`** handles downloading and preparing the annotation databases. Usage info [here](databases-usage.md). + > databases must be used to properly prepare databases pror to running annotatation + - [**`annotate`**](annotate.md) uses these databases for transcriptome annotation. Usage info [here](annotate.md). + +### Command line arguments +The `dammit run` command includes a number of command-line options that can be used with +either `databases` or `annotate` ("shared arguments"). To view these, run `dammit run --help`. When adding +these to the command line, the must come _before_ the workflow name. + +Thus the command line structure should be: + +`dammit run [shared arguments] [workflow-specific arguments]` + +To see the workflow-specific command-line options, run: + +`dammit run databases --help` +or +`dammit run annotate --help` + + +## Advanced usage: additional dammit components + +Each annotation program run as part of a dammit pipeline produces an +annotation file with a tool-specific formatting and indexing (0-based or 1-based). +Dammit includes a set of file conversion utilities that translate each output +file to a standardized gff3 format, and a utility that combines the standardized +output into a single set of annotations for output in gff3 and fasta +(the final outputs of each dammit pipeline). + +Each of these conversion utilities can now be run independently, so that they +can be used outside of a full dammit pipeline run. You may find these components +useful if you want to run these tools on additional databases not included in the +dammit pipelines. + +**FASTA munging commands:** + + - **`rename-fasta`** Copy a FASTA file and rename the headers. + - **`transcriptome-stats`** Run basic metrics on a transcriptome + - **`annotate-fasta`** Annotate a FASTA file from a GFF3 file. + + +**Filtering commands:** + + - **`best-hits`** Filter query best-hits from a MAF file. + +**Conversion commands:** + + - **`maf-to-gff3`** Convert MAF to GFF3. + - **`shmlast-to-gff3`** Convert shmlast CSV output to GFF3. + - **`hmmscan-to-gff3`** Convert HMMER to GFF3. + - **`cmscan-to-gff3`** Convert Infernal's cmscan output to GFF3. + +**Transformation commands:** + + - **`merge-gff3`** Merge a collection of GFF3 files. + - **`remap-hmmer-coords`** Remap hmmscan coordinates using TransDecoder ORF predictions. diff --git a/doc/dammit-results.md b/doc/dammit-results.md new file mode 100644 index 00000000..cc2fca3e --- /dev/null +++ b/doc/dammit-results.md @@ -0,0 +1,62 @@ +Dammit Results +=== + +## dammit output + +After a successful run, you'll have a new directory called `BASENAME.fasta.dammit`. If you look inside, you'll see a lot of files. For example, for a transcriptome with basename `trinity.nema`, the folder `trinity.nema.fasta.dammit` should contain the following files: + +``` +ls trinity.nema.fasta.dammit/ +``` + +``` + annotate.doit.db trinity.nema.fasta.dammit.namemap.csv trinity.nema.fasta.transdecoder.pep + dammit.log trinity.nema.fasta.dammit.stats.json trinity.nema.fasta.x.nema.reference.prot.faa.crbl.csv + run_trinity.nema.fasta.metazoa.busco.results trinity.nema.fasta.transdecoder.bed trinity.nema.fasta.x.nema.reference.prot.faa.crbl.gff3 + tmp trinity.nema.fasta.transdecoder.cds trinity.nema.fasta.x.nema.reference.prot.faa.crbl.model.csv + trinity.nema.fasta trinity.nema.fasta.transdecoder_dir trinity.nema.fasta.x.nema.reference.prot.faa.crbl.model.plot.pdf + trinity.nema.fasta.dammit.fasta trinity.nema.fasta.transdecoder.gff3 + trinity.nema.fasta.dammit.gff3 trinity.nema.fasta.transdecoder.mRNA +``` + +The two most important files are `trinity.nema.fasta.dammit.fasta` and `trinity.nema.fasta.dammit.gff3`, as they contain the aggregated annotation info per transcript. +`trinity.nema.fasta.dammit.stats.json` also gives summary stats that are quite useful. If you'd like to look into the remaining files, here are the programs that produced each. + +## Parsing the dammit GFF3 + +Dammit provides transcript annotations for mapping against all of the databases in the pipeline you selected +in the final dammit files, `BASENAME.fasta.dammit.fasta` and `BASENAME.fasta.dammit.gff3`. +If you'd like to select certain annotations (e.g. to create an alternative gene-to-transcript map), you can +use python or R to parse the `gff3` results file. If using python, dammit provides a `GFF3Parser` utility to facilitate parsing. + +``` +import pandas as pd +from dammit.fileio.gff3 import GFF3Parser +``` + +``` +gff_file = "nema-trinity.fa.dammit/nema-trinity.fa.dammit.gff3" +annotations = GFF3Parser(filename=gff_file).read() +names = annotations.sort_values(by=['seqid', 'score'], ascending=True).query('score < 1e-05').drop_duplicates(subset='seqid')[['seqid', 'Name']] +new_file = names.dropna(axis=0,how='all') +new_file.head() +``` + +Try commands like: +``` +annotations.columns +``` + +``` +annotations.head() +``` + +``` +annotations.head(50) +``` + + +## Other tutorials and Workshop materials + +See this workshop [tutorial](https://angus.readthedocs.io/en/2018/dammit_annotation.html)for further practice with using `dammit` for annotating a *de novo* transcriptome assembly. +Please note that the commands used were for a prior version of dammit, but all analysis remains relevant. diff --git a/doc/database-about.md b/doc/database-about.md index bb71e8c9..74d71115 100644 --- a/doc/database-about.md +++ b/doc/database-about.md @@ -2,7 +2,7 @@ title: 'About the Databases' --- -dammit uses the following databases: +dammit can make use of the following databases: 1. [Pfam-A](http://pfam.xfam.org/) @@ -32,29 +32,44 @@ dammit uses the following databases: > domains of life. They are used with an accompanying BUSCO program > which assesses the completeness of a genome, transcriptome, or > list of genes. There are multiple BUSCO databases, and which one - > you use depends on your particular organism. Currently available - > databases are: - > - > 1. Metazoa - > 2. Vertebrata - > 3. Arthropoda - > 4. Eukaryota + > you use depends on your particular organism. > > dammit uses the metazoa database by default, but different > databases can be used with the `--busco-group` parameter. You > should try to use the database which most closely bounds your > organism. -5. [uniref90](http://www.uniprot.org/help/uniref) +5. [Swiss-Prot](https://www.uniprot.org/help/about) + + > Swiss-Prot is the manually reviewed and curated non-redundant + > protein sequence database. The aim to to provide high-quality + > annotations linked to all known information about each protein. + > dammit now maps to Swiss-Prot by default. + +6. [uniref90](http://www.uniprot.org/help/uniref) > uniref is a curated collection of most known proteins, clustered > at a 90% similarity threshold. This database is comprehensive, and > thus quite enormous. dammit does not include it by default due to - > its size, but it can be installed and used with the `--full` flag. + > its size, but it can be installed and used with the + > `--pipeline full` flag. + +7. [NR](http://www.uniprot.org/help/uniref) + + > The `nr` is a very large database consisting of both non-curated + > and curated database entries. While the name stands for "non-redundant", + > this databse is no longer non-redundant. Given the time and memory requirments, + > NR is only a good choice for species and/or sequences you're unable to confidently + > annotate via other databases. It can be installed and used with + > the `--pipeline nr` flag. + + +The specific databases run can be selected via the `--pipeline` flag. +For all available pipelines, see the [pipelines](pipelines.md) section. -A command using all of these potential options and databases might look -like: +To install, for example, all databases required for `full` +in a custom location (`/path/to/dbs`), you could run the following: ``` -dammit databases --install --database-dir /path/to/dbs --full --busco-group arthropoda +dammit run databases --install --database-dir /path/to/dbs --pipeline full ``` diff --git a/doc/database-advanced.md b/doc/database-advanced.md index 18f505bb..cec58ab9 100644 --- a/doc/database-advanced.md +++ b/doc/database-advanced.md @@ -14,7 +14,7 @@ are a few scenarios you might run in to. > `--install`, it will find the existing files and prep them if > necessary.: > - > dammit databases --database-dir --install + > dammit run databases --database-dir --install 2. Same as above, but they have different names. @@ -39,7 +39,7 @@ are a few scenarios you might run in to. > For a complete listing of the expected names, just run the > `databases` command: > - > dammit databases + > dammit run databases 3. You have the databases, but they're scattered to the virtual winds. @@ -60,6 +60,6 @@ lots of hard drive space, you can just say "to hell with it!" and reinstall everything with: ``` -dammit databases --install +dammit run databases --install ``` diff --git a/doc/database-usage.md b/doc/database-usage.md index 88abf2e1..de791eb7 100644 --- a/doc/database-usage.md +++ b/doc/database-usage.md @@ -1,42 +1,58 @@ ---- -title: 'Database Usage' ---- +# Database Usage -dammit handles databases under the `dammit databases` subcommand. By -default, dammit looks for databases in -`$HOME/.dammit/databases` and will install them there if -missing. If you have some of the databases already, you can inform -dammit with the `--database-dir` flag. +The dammit `databases` component handles downloading and preparing the annotation databases. +## Check and install databases + +By default, dammit downloads databases to your home directory: `$HOME/.dammit/databases` To check for databases in the default location: ``` -dammit databases +dammit run databases +``` +This will tell you what databases still need to be installed to run the default annotation pipeline. + +To install databases: +``` +dammit run databases --install ``` +> Notes: +> +> 1. If you're on an HPC or other system with limited space in your home directory, follow +> the instructions below to specify a custom location. +> +> 2. If you've already downloaded some databases and you'd like to use them with dammit, see the [Advanced Database Handling](database-advanced.md) section. + + +## Custom database locations -To check for them in a custom location, you can either use the -`--database-dir` flag: +If you'd like to store dammit databases elsewhere, there are two ways to specify a custom location, the `--database-dir` flag or the `DAMMIT_DB_DIR` environment variable. +### Using the `--database-dir` flag: + +Check for databases in `/path/to/databases`: ``` dammit databases --database-dir /path/to/databases ``` -or, you can set the `DAMMIT_DB_DIR` environment variable. -The flag will supersede this variable, falling back to the default if -neither is set. For example: - +Install databases in `/path/to/databases`: ``` -export DAMMIT_DB_DIR=/path/to/databases +dammit databases --database-dir /path/to/databases --install ``` -This can also be added to your `$HOME/.bashrc` file to make -it persistent. +### Set up an environment variable -To download and install them into the default directory: +Alternatively, you can set up the `DAMMIT_DB_DIR` environment variable. + +Set up the variable in bash. Execute this the command line to use during a single session, or add this to your `$HOME/.bashrc` file to make it persistent. ``` -dammit databases --install +export DAMMIT_DB_DIR=/path/to/databases ``` +> Note that the `--database-dir` flag (above) will supersede this variable, +> falling back to the default if neither is set. + +When this variable is set up, the standard commands will check for databases in `/path/to/databases` rather than `$HOME/.dammit/databases` For info on the specific databases used in dammit, see [About Databases](database-about.md). diff --git a/doc/dev_notes.md b/doc/dev_notes.md index 3bcce895..ac686a89 100644 --- a/doc/dev_notes.md +++ b/doc/dev_notes.md @@ -1,186 +1,107 @@ -# For `dammit` developers +# Contributing to `dammit` -[dammit!](https://github.com/dib-lab/dammit) +We welcome external contributions to `dammit`! +All interactions around `dammit` must follow the [dib-lab Code of Conduct](http://ivory.idyll.org/lab/coc.html). +Dammit 2.0 was written by [Camille Scott](http://www.camillescott.org/) and [N Tessa Pierce](http://bluegenes.github.io/). +We are not funded to maintain `dammit`, but will endeavor to do so to the best of our abilities. +In particular, we welcome contributions that address bugs reported in the [issue tracker](https://github.com/dib-lab/dammit/issues). +All additions and bugfixes must be properly covered by tests. -## Setting up your local computer for `dammit` devevelopment +Dammit relies on the snakemake workflow software. +To learn snakemake, check out the comprehensive [documentation](https://snakemake.readthedocs.io/en/stable/), and maybe start with snakemake tutorials such as [this one by Titus Brown](https://github.com/ctb/2019-snakemake-ucdavis). -We can basically follow the [instructions for travis](https://github.com/dib-lab/dammit/blob/master/.travis.yml), because we're telling [travis](https://travis-ci.org/dib-lab/dammit/) to do the same things we want to do on our local computers. +## Setting up your local computer for `dammit` development -Make sure conda is installed. If not, here are instructions: -``` -wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -bash Miniconda3-latest-Linux-x86_64.sh -b -export PATH="$HOME/miniconda3/bin:$PATH" -``` +Make sure conda is installed and channels are properly set up, as in the [installation instructions](install.md). + +### Set up the dammit code on your local machine -Fork `dammit` repository to your account. Clone this fork to your local computer, then create a dev branch called `testing`: +Fork the `dammit` [repository](https://github.com/dib-lab/dammit) to your GitHub account. +`git clone` this fork to your local computer, then create and name a development branch. +For example, the code below creates a dev branch named `testing`: ``` -git clone https://github.com/username/dammit.git -git remote add upstream https://github.com/dib-lab/dammit.git +git clone https://github.com/YOUR-USERNAME/dammit.git +cd dammit git checkout -b testing git branch ``` Now you are on the `testing` branch. -Keep original repository in the `master` branch. Make sure it is up-to-date periodically by running: +Keep the original repository in the `main` branch, so that you can stay up to date with any changes in the main repo. +To do this, first add a remote called `upstream`, which links to the main dammit repository. ``` -git pull upstream master +git remote add upstream https://github.com/dib-lab/dammit.git ``` -Set up a Python 3 environment to work in: +Then, make sure the `main` branch is up-to-date by periodically running: ``` -conda create -n dammit_dev python=3 -source activate dammit_dev +git pull upstream main ``` -Install dependencies: -``` -conda config --set always_yes yes --set changeps1 no -conda config --add channels defaults -conda config --add channels bioconda -conda config --add channels conda-forge -conda install python numpy pandas numexpr>=2.3.1 khmer>=2.1 sphinx>1.3.1 sphinx_rtd_theme>=0.1.9 pytest pytest-runner doit>=0.29.0 matplotlib shmlast infernal hmmer transdecoder=3.0.1 last busco=3.0.2 parallel bioconductor-seqlogo -python setup.py install -``` -Last line of the output should be: +### Create a development environment + +Create a conda environment with dependencies installed. + +After setting up the code (above), run: ``` -Finished processing dependencies for dammit==1.0rc2 +conda env create -f environment.yml -n dammit_dev +conda activate dammit_dev ``` +Now, install an editable version of `dammit` into the `dammit_dev` environment: -Lastly, install databases (will install in `~/.dammit/databases/`) -``` -dammit databases --install ``` -Output should be: +pip install -e '.' ``` -(dammit_dev) campus-019-072:dammit johnsolk$ dammit databases --install -Unable to revert mtime: /Library/Fonts -# dammit -## a tool for easy de novo transcriptome annotation -by Camille Scott +**What (of the below) do we want to keep?** -**v1.0rc2**, 2018 +### Run Tests -## submodule: databases -### Database Install -#### Info -* Database Directory: /Users/johnsolk/.dammit/databases -* Doit Database: /Users/johnsolk/.dammit/databases/databases.doit.db - - -*All database tasks up-to-date.* - -Nothing to install! +To run tests that do not require databases, run ``` -Now you are ready to edit and make changes! - -## To-do for `dammit` - -- [ ] update transdecoder version -- [ ] orthodb version (other database versions?) -- [ ] add swissprot -- [x] change order of conda channels to include conda-forge last -- [ ] update documentation -- [ ] add pipeline for accepting .pep file as input (skips transdecoder, transcriptome stats and BUSCO tasks) - -#### Versioning - -A new version is required when a new version of a database is added or a major change happens that changes the commandline interface. Change the VERSION file when this hapens. - -(Note 11/30/2018: We should make all changes above in the T-do, then move to v1.1) - -## Notes on dammit - -Written by [Camille Scott](http://www.camillescott.org/). See [tutorial](https://angus.readthedocs.io/en/2018/dammit_annotation.html). - -1. Look at [pydoit](http://pydoit.org/index.html) documentation, and [Camille's workshop](https://dib-training.readthedocs.io/en/pub/2016-01-20-pydoit-lr.html) -2. [pypi](https://pypi.org/project/dammit/#history) and [bioconda](https://anaconda.org/bioconda/dammit) (supported method of installation) - +make ci-test ``` -conda config --add channels defaults -conda config --add channels bioconda -conda config --add channels conda-forge +To run longer tests, run: +``` +make long-test ``` -### Architecture: - -#### Take a look at code and tests in the `dammit` directory: - -* The core driver of `dammit` is the `dammit/app.py` file, which sets up the commandline arguments. Everything happens here. If you want to add an argument, this is where it hapens. -* There are two subcommand task-handler files: `annotate.py` and `databases.py` -* Tasks are steps being run, separated into different files. For example, the `hmmer.py` file contains all hmmer tasks. -* The task handler has on logger, pulls from config to figure out where databases are located (all happens in background), some doit stuff happening -* [Decorators](https://realpython.com/primer-on-python-decorators/) transfer the function's `return` into a doit function (e.g. line 59 of shell.py) `import doit_task` then `@doit_task` - -`databases`, 2 pipelines: - - * `quick` - * `full` - -`annotate`, more pipelines: - - * `uniref1` - * `full` - * `nr` - -#### `config.json` - -Can use custom `config.json` file to include different parameters for the programs run by the tasks, e.g. `transdecoder LongOrgs -m 50`, etc. - -#### `parallel.py` - -hmmer, infernal, lastl, - -requires gnu parallel - -(There are instructions for how to runon multi-node hpc, somewhere.) - -#### `ui.py` - -output for user to markdown formatting for copying/pasting into GitHub issue reporting +## Code structure -`generate-test-data-.sh` re-genreates test data adn puts it in proper dirs +- Main command line: `dammit/cli.py` +- command line code for each dammit component: `dammit/components` +- Snakemake-related code: + - main snakefile: `dammit/workflows/dammit.snakefile` + - databases snakemake rules: `databases.snakefile` + - annotate snakemake rules: `annotate.snakefile` + - snakemake wrappers for each included tool: + - `dammit/wrappers` -### TESTS! +## Internal Configuration Files -`dammit/tests` + - primary config file: `dammit/config.yml` + - default configuration for all steps run within `dammit run` + - databases config file: `dammit/databases.yml` + - download and file naming info for all databases + - pipeline config file: `dammit/pipelines.yml` + - sets tools and databases used in each pipeline -Run `test_databases.py` yourself, locally (because databases cannot be cached on travis-ci) +## Regenerating test data -* makes sure tasks and pipeline run and produce output, they don't all check expected output. some integration output. -* uses pytest -* set of tests files -* testing pydoit tasks is a pain -* under utils, file to run tasks. give it a list of tasks, it will execute in own directory. -* functions start with 'test', check assertions -* fixtures are a means of setting upa consistent environment before running an individual test, e.g. be in a clean directory. tmpdir will create a randomly name temporary directory. -* make tests for new tasks (Sometimes they will take a long time to run...) -* `test_annotate.py` must be run locally by yourself. -* before pushing release, do both of these -* `make long tests` (assumes environment is already setup) -* [travis-ci](https://travis-ci.org/dib-lab/dammit/) is building the recipe that lives in the repo -* `make-ci-test`, not long and not huge and not requires_datbases +`generate-test-data-.sh` re-generates test data and puts it in proper dirs ## Reviewing a PR **Tests must pass before merging!** * Have there been radical changes? (Are you adding things to handler, maybe time to take a step back and make sure code uses reasonable variable names, tests, etc) -* Does travis build? +* Does github-actions build? * Try to make commit messages somewhat informative If these all seem reasonable to you, approve! -## Fix travis: - -`.travis.yml` - -* make sure conda env uses right Python -* fix conda channel order - ## Bioconda * https://anaconda.org/bioconda/dammit diff --git a/doc/index.md b/doc/index.md index 018eea3a..3f65dd0b 100644 --- a/doc/index.md +++ b/doc/index.md @@ -21,14 +21,15 @@ mean programs with nonfree licenses *or* programs which are overly difficult to install and configure \-- we believe that access is a part of openness. + Details ======= -Authors: Camille Scott +Authors: Camille Scott and N. Tessa Pierce Contact: -GitHub: +GitHub: License: BSD diff --git a/doc/install.md b/doc/install.md index 4f519a1d..2513f01d 100644 --- a/doc/install.md +++ b/doc/install.md @@ -1,35 +1,80 @@ --- -title: Bioconda +title: Install dammit via conda summary: Installing dammit via conda and bioconda. --- +dammit, for now, is officially supported on GNU/Linux systems via +[bioconda](https://bioconda.github.io/index.html). macOS support will be +available via bioconda soon. + +Assuming you already have `conda`, install with: + + conda create -n dammit-env dammit=2 + +Then `conda activate dammit-env` and you're good to go. + +## System Requirements + +For the standard pipeline, dammit needs ~18GB of storage space to store its +prepared databases, plus a few hundred MB per BUSCO database. For the +standard annotation pipeline, we recommend at least 16GB of RAM. This can be +reduced by editing LAST parameters via a custom configuration file (see +the [configuration](configuration.md)) section. + +The `full` pipeline, which uses uniref90, needs several hundred GB of +space and considerable RAM to prepare the databases. You'll also want +either a fat internet connection or a big cup of patience to download +uniref. + +For some species, we have found that the amount of RAM required can be proportional to the size of the transcriptome being annotated. + + As of version 1.\*, the recommended and supported installation platform for dammit is via [bioconda](https://anaconda.org/bioconda/dammit), as it greatly simplifies managing dammit's many dependencies. -## Installing (bio)conda +## Install and configure miniconda -If you already -have anaconda installed, proceed to the next step. Otherwise, you can -either follow the instructions from bioconda, or if you're on Ubuntu -(or most GNU/Linux platforms), install it directly into your home folder -with: +If you already have conda (e.g. via miniconda or anaconda) installed, +proceed to the next step. If you're on Ubuntu (or most GNU/Linux platforms), +you can follow these commands to install it directly into your home folder. +If on Mac, please follow the bioconda instructions [here](https://bioconda.github.io/user/install.html). wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh && bash miniconda.sh -b -p $HOME/miniconda - echo 'export PATH="$HOME/miniconda/bin:$PATH"' >> $HOME/.bashrc + source $HOME/miniconda/bin/activate + conda init + +Then, finish conda setup by configuring channels: + + conda config --add channels defaults + conda config --add channels bioconda + conda config --add channels conda-forge -## Installing Dammit +!!! note + + These commands stack, so the highest priority channel here will be `conda-forge`, followed by `bioconda` and then the `defaults` channel. + This is the recommended channel order. You can reorder your channels at any time by reexecuting these `config` commands. + +## Install dammit It's recommended that you use conda environments to separate your packages, though it isn't strictly necessary: - conda create -n dammit python=3 - source activate dammit +First, create a new conda environment and install dammit: + + conda create -n dammit-env python=3 dammit -Now, add the channels and install dammit: +> You should only need to do this once for a given computer system. - conda config --add channels defaults - conda config --add channels conda-forge - conda config --add channels bioconda +## Activate the Environment + +To use the dammit software, you'll need to `activate` the environment: + + conda activate dammit-env + +!!! note + + When you'd like to leave your environment, you can type `conda deactivate` and you will return to the base environment. + Alternatively, the environment will automatically be deactivated if you close your terminal connection. + To reactivate, run `conda activate dammit`. - conda install dammit diff --git a/doc/pipelines.md b/doc/pipelines.md new file mode 100644 index 00000000..89b77e48 --- /dev/null +++ b/doc/pipelines.md @@ -0,0 +1,119 @@ +# Annotation Pipelines + +The **`default`** pipeline is suitable for most purposes. +However, dammit has several alternative workflows that either +reduce the number of databases and tools run (the `quick` pipeline) +or annotate with larger databases, such as UniRef90 ( `full` pipeline). + +> **Note:** To use these pipelines, first run `dammit run databases` to +> make sure the relevant databases are installed. +> Then you can proceed with `dammit run annotate` + + +## Default + +By default, `dammit` runs the following: + +- **default:** + - `busco`quality assessment + - `transdecoder` ORF prediction + - `shmlast` to any user databases + - `hmmscan` to Pfam-A + - `cmscan` to Rfam + - `LAST` mapping to OrthoDB and Swiss-Prot + +The databases used for this pipeline require approximately ~18GB of storage space, +plus a few hundred MB per busco database. We recommend running this pipeline with +at least 16GB of RAM. + +code to run this pipeline: + + dammit run databases --install + dammit run annotate + +> If specifying a custom location for your databases, add `--databases-dir /path/to/dbs` + + +## Alternative annotation pipelines: + +### quick pipeline + +- **quick (`--pipeline quick`):** + - `busco`quality assessment + - `transdecoder` ORF prediction + - `shmlast` to any user databases + +The `quick` pipeline can be used for running a minimal annotation run: +BUSCO quality assessment, ORF prediction with transdecoder, and `shmlast` +to map to any user databases. While this pipeline may require less database +space, we still recommend running with 16G of RAM, especially if mapping +to a user-provided protein database. + +code to run this pipeline: + + dammit run databases --install --pipeline quick + dammit run annotate --pipeline quick + +> If specifying a custom location for your databases, add `--databases-dir /path/to/dbs` + +### full pipeline + +_warning: time and resource intensive!_ + +The `full` pipeline starts from the `default` pipeline and adds a mapping +database, UniRef90. + +**UniRef90** is a set of UniProt sequences clustered +by >=90% sequence identity. UniRef allows a searching to a larger set of +sequence records while hiding redundant sequences. See the [UniRef +documentation](https://www.uniprot.org/help/uniref) for more. + +- **full (`--pipeline full`):** + - `busco` quality assessment + - `transdecoder` ORF prediction + - `shmlast` to any user databases + - `hmmscan` to Pfam-A + - `cmscan` to Rfam + - `LAST` mapping to OrthoDB, Swiss-Prot, and **UniRef90** + +As of fall 2020, the UniRef90 fasta is 26G (gzipped). + +code to run this pipeline: + + dammit run databases --install --pipeline full + dammit run annotate --pipeline full + +> If specifying a custom location for your databases, add `--databases-dir /path/to/dbs` + +### nr pipeline + +_warning: REALLY time and resource intensive!_ + +**nr** is a very large database consisting of both non-curated +and curated database entries. While the name stands for "non-redundant", +this databse is no longer non-redundant. Given the time and memory requirments, +nr is only a good choice for species and/or sequences you're unable to confidently +annotate via other databases. + +- **nr (`--pipeline nr`):** + - `busco` quality assessment + - `transdecoder` ORF prediction + - `shmlast` to any user databases + - `hmmscan` to Pfam-A + - `cmscan` to Rfam + - `LAST` mapping to OrthoDB, Swiss-Prot, and **nr** + +As of fall 2020, the nr fasta is 81G (gzipped). + +code to run this pipeline: + + dammit run databases --install --pipeline nr + dammit run annotate --pipeline nr + +> If specifying a custom location for your databases, add `--databases-dir /path/to/dbs` + +**Note:** Since all of these pipelines use a core set of tools, and since dammit uses `snakemake` +to keep track of the files that have been run, dammit will not rerun the core tools if decide +to alter the pipeline you're running. So for example, you could start by running a `quick` +run, and later run `default` if desired. In that case, `dammit` would run only the new annotation +steps, and reintegrate the relevant outputs into new `dammit` gff3 and annotated fasta files. diff --git a/doc/quickstart.md b/doc/quickstart.md new file mode 100644 index 00000000..c9516c92 --- /dev/null +++ b/doc/quickstart.md @@ -0,0 +1,110 @@ +# Quickstart Tutorial + +Once you have dammit [installed](install.md), you'll need to download +and prepare databases before you can annotate a transcriptome. This +quickstart takes you through database preparation (with `dammit run databases`). +These will take a while to prepare, but are the same databases you'll need +for most annotation runs. Once those are in place, we'll run an annotation +using a small sample dataset. + +## Check and prepare databases + +Here, we'll install the main databases, as well as the +`eukaryota` BUSCO database for our test yeast dataset (below). This could +take a while, so consider walking away and getting yourself a cup of +coffee. + +By default, dammit downloads databases into your home directory, +following the XDG specification: `$HOME/.local/share/dammit/databases`. + + +!!! note + + If you're on an HPC or other system with limited space in your home directory, + or if you've already downloaded some databases and you'd like to use them with dammit, + see the [Database Usage](database-usage.md) section to specify a custom location. + +If you installed dammit into a virtual environment, be sure to +activate it first: +``` +conda activate dammit +``` + +Now install databases: +``` +dammit run databases --install +``` + +While the initial download takes a while, once it's done, you won't need +to do it again. `dammit` keeps track of the database state and won't +repeat work it's already completed, even if you accidentally rerun with +the `--install` flag. + +## Download Annotation Test Data + +First let's download some test data. We'll start small and use a +*Schizosaccharomyces pombe* transcriptome. Make a working directory and +move there, and then download the file: + +``` +mkdir dammit_test +cd dammit_test +wget ftp://ftp.ebi.ac.uk/pub/databases/pombase/OLD/20170322/FASTA/cdna_nointrons_utrs.fa.gz +wget ftp://ftp.ebi.ac.uk/pub/databases/pombase/OLD/20170322/FASTA/pep.fa.gz +``` + +Decompress the file with gunzip: + +``` +gunzip cdna_nointrons_utrs.fa.gz pep.fa.gz +``` + +## Just annotate it, dammit! + +With the default databases installed and sample data in hand, we can do a simple run of +the annotator. We'll use `pep.fa` as a user database; this is a toy example, +seeing as these proteins came from the same set of transcripts as we're +annotating, but they illustrate the usage nicely enough. We'll also +specify a non-default BUSCO group (eukaryota). You can replace the argument to +`--n_threads` with however many cores are available on your system in +order to speed it up.: + +``` +dammit run --n_threads 1 annotate cdna_nointrons_utrs.fa --user-databases pep.fa --busco-group eukaryota +``` + +This will take a bit, so go get another cup of coffee... + +!!! note + + By default, `--n-threads` will correspond to the number of physical cores on the given CPU. In + many HPC environments, you'll need to explicitly set `--n-threads` to the number of cores + you've asked the scheduler for. + + Also! `--n-threads` comes *after* `run` but *before* `annotate`, because its shared with the + `databases` command. + +For more information and options on the `annotate` command, see [annotate usage](annotate.md). + +## Annotation Output + +After a successful run, you'll have a new directory called `[BASENAME].dammit` in this case, `cdna_nointrons_utrs.dammit`. +If you look inside, you'll see a lot of files. + +``` +ls trinity.nema.fasta.dammit/ +``` +``` + annotate.doit.db trinity.nema.fasta.dammit.namemap.csv trinity.nema.fasta.transdecoder.pep + dammit.log trinity.nema.fasta.dammit.stats.json trinity.nema.fasta.x.nema.reference.prot.faa.crbl.csv + run_trinity.nema.fasta.metazoa.busco.results trinity.nema.fasta.transdecoder.bed trinity.nema.fasta.x.nema.reference.prot.faa.crbl.gff3 + tmp trinity.nema.fasta.transdecoder.cds trinity.nema.fasta.x.nema.reference.prot.faa.crbl.model.csv + trinity.nema.fasta trinity.nema.fasta.transdecoder_dir trinity.nema.fasta.x.nema.reference.prot.faa.crbl.model.plot.pdf + trinity.nema.fasta.dammit.fasta trinity.nema.fasta.transdecoder.gff3 + trinity.nema.fasta.dammit.gff3 trinity.nema.fasta.transdecoder.mRNA +``` + +The two most important files are `trinity.nema.fasta.dammit.fasta` and `trinity.nema.fasta.dammit.gff3`, as they contain the aggregated annotation info per transcript. +`trinity.nema.fasta.dammit.stats.json` also gives summary stats that are quite useful. + +For more information on the results, see [dammit results](dammit-results.md). diff --git a/doc/tutorial.md b/doc/tutorial.md index 52bea96e..c1d7d318 100644 --- a/doc/tutorial.md +++ b/doc/tutorial.md @@ -1,10 +1,11 @@ -# Tutorial +--- +title: Just Annotate it, dammit! +--- Once you have the dependencies installed, it's time to actually annotate something! This guide will take you through a short example on some test data. -See this workshop [tutorial](https://angus.readthedocs.io/en/2018/dammit_annotation.html) for further practice with using `dammit` for annotating a *de novo* transcriptome assembly. ## Data @@ -62,3 +63,8 @@ dammit annotate cdna_nointrons_utrs.fa --user-databases pep.fa --busco-group euk ``` This will take a bit, so go get another cup of coffee... + +## Other tutorials and Workshop materials + +See this workshop [tutorial](https://angus.readthedocs.io/en/2018/dammit_annotation.html) for further practice with using `dammit` for annotating a *de novo* transcriptome assembly. + diff --git a/mkdocs.yml b/mkdocs.yml index b6cb5182..8261ce8f 100755 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -38,17 +38,31 @@ extra_css: # give a title for each page nav: - 'Home': 'index.md' - - 'About': 'about.md' - - 'Installation': - - 'Requirements': 'system_requirements.md' - - 'Bioconda': 'install.md' - - - 'Databases': - - 'Basic Usage': 'database-usage.md' - - 'About the Databases': 'database-about.md' - - 'Advanced Database Handling': 'database-advanced.md' - - - 'Tutorial': 'tutorial.md' - + - 'About dammit': 'about.md' + - 'Installation': 'install.md' + - 'Quickstart Tutorial': quickstart.md + - 'Understanding output': + - 'Use and parse results': 'dammit-results.md' + - 'Details and Configuration': + - 'Components': 'dammit-components.md' + - 'Databases': + - 'Database Usage': 'database-usage.md' + - 'Database Info': 'database-about.md' + - 'Advanced Database Setup': 'database-advanced.md' + - 'Annotate': + - 'Annotate Usage': 'annotate.md' + - 'Annotation Pipelines': 'pipelines.md' + + #- 'Full Tutorial': 'tutorial.md' + - 'Advanced Usage and Configuration': + - 'Advanced Configuration': 'configuration.md' + - 'Distributing dammit jobs on a Cluster': cluster.md + - 'For developers': - 'Notes for developers': 'dev_notes.md' + +markdown_extensions: + - toc: + permalink:  + - admonition + - def_list