Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] docs2.0 #197

Open
wants to merge 28 commits into
base: v2_staging
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[![image](https://travis-ci.org/dib-lab/dammit.svg)](https://travis-ci.org/dib-lab/dammit)
![tests](https://github.com/github/docs/actions/workflows/tests.yml/badge.svg)
[![Documentation Status](https://readthedocs.org/projects/dammit/badge/)](http://dammit.readthedocs.io/en/latest)

*"I love writing BLAST parsers!" -- no one, ever*
Expand All @@ -19,11 +19,11 @@ Install dammit with (bio)conda:

Download and install a subset of the databases:

dammit databases --install --quick
dammit run --pipeline quick databases --install

And the annotate with:

dammit annotate <transcriptome_fasta>
dammit run --pipeline quick annotate <transcriptome_fasta>

Head over to the [docs](http://dib-lab.github.io/dammit/) for much more detailed
information!
Expand Down
102 changes: 54 additions & 48 deletions doc/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This page goes a little more in depth on the software and its goals.

## Motivations
## Background and Motivations

Several different factors motivated dammit's development. The first of
these was the sea lamprey transcriptome project, which had annotation as
Expand All @@ -24,80 +24,86 @@ Implicit to these motivations is some idea of what a good annotator
5. It should be relatively fast
6. It should try to be correct, insofar as any computational approach
can be "correct"
7. It should give the user some measure of confidence for its results.

## The Obligatory Flowchart

![The Workflow](static/workflow.svg)

## Software Used

- TransDecoder
- BUSCO
- HMMER
- Infernal
- LAST
- crb-blast (for now)
- pydoit (under the hood)

All of these are Free Software, as in freedom and beer

## Databases
## Databases Used

- Pfam-A
- Rfam
- OrthoDB
- Swiss-Prot
- BUSCO databases
- Uniref90
- User-supplied protein databases

The last one is important, and sometimes ignored.
The last one is important, and sometimes ignored.
Dammit uses an approach similar to Conditional Reciprocal Best Blast
to map to user-supplied protein databases (details below).

## Conditional Reciprocal Best LAST
To see more about the included databases,
see the [About Databases](database-about.md) section.

Building off Richard and co's work on Conditional Reciprocal Best
BLAST, I've implemented a new version with Python and LAST -- CRBL.
The original lives [here](https://github.com/cboursnell/crb-blast).

Why??
## Software Used

- BLAST is too slooooooow
- Ruby is yet another dependency to have users install
- With Python and scikit learn, I have freedom to toy with models (and
learn stuff)
The specific set of software and databases used can be modified by specifying different [pipelines](pipelines.md).
The full set of software than can be run is:

And, of course, some of these databases are BIG. Doing `blastx` and
`tblastn` between a reasonably sized transcriptome and Uniref90 is not
an experience you want to have.
- TransDecoder
- BUSCO
- HMMER
- Infernal
- LAST
- shmlast (for crb-blast to user-supplied protein databases)

ie, practical concerns.
All of these are Free Software, as in freedom and beer.

## A brief intro to CRBB
### shmlast: Conditional Reciprocal Best LAST for mapping to user databases

- Reciprocal Best Hits (RBH) is a standard method for ortholog
detection
- Transcriptomes have multiple multiple transcript isoforms, which
confounds RBH
- CRBB uses machine learning to get at this problem
Reciprocal Best Hit mapping (RBH) is a standard method for ortholog detection.
However, transcriptomes have multiple transcript isoforms, which confound RBH.

![](static/RBH.svg)

CRBB attempts to associate those isoforms with appropriate annotations
by learning an appropriate e-value cutoff for different transcript
lengths.
**Conditional Reciprocal Best Blast (CRBB)** attempts to associate those isoforms
with appropriate annotations by learning an appropriate e-value cutoff for
different transcript lengths. The original implementation of CRBB
can be found [here](https://github.com/cboursnell/crb-blast).

![CRBB](static/CRBB_decision.png)

*from
http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004365\#s5*

## CRBL

For CRBL, instead of fitting a linear model, we train a model.
*from [Deep Evolutionary Comparison of Gene Expression Identifies Parallel Recruitment of Trans-Factors in Two Independent Origins of C4 Photosynthesis](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004365)*

- SVM
- Naive bayes
**shmlast** is a reimplementation of the Conditional Reciprocal Best Hits
algorithm for finding potential orthologs between a transcriptome and
a species-specific protein database. It uses the LAST aligner and the
pydata stack to achieve much better performance while staying in the
Python ecosystem.

One limitation is that LAST has no equivalent to `tblastn`. So, we find
the RBHs using the TransDecoder ORFs, and then use the model on the
translated transcriptome versus database hits.
translated transcriptome versus database hits.

`shmlast` is published in JOSS, doi:[10.21105/joss.00142](https://joss.theoj.org/papers/10.21105/joss.00142).


## The Dammit Software

dammit is built on the [Snakemake](https://snakemake.readthedocs.io/en/stable/)
workflow management system. This means that the dammit pipeline enjoys all the features of any
Snakemake workflow: reproducibility, ability to resume, cluster support, and per-task environment
management. Each step in dammit's pipeline(s) is implemented as a Snakemake
[wrapper](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#wrappers);
when dammit is executed, it generates the targets for the pipeline being run as specified in its
pipelines file and passes them along to the Snakemake executable. The dammit frontend simplifies
the interface for the user and constrains the inputs and options to ensure the pipeline
will always run correctly.

One of the essential, and most annoying, parts of annotation is the conversion and collation
of information from many different file formats. Dammit includes a suite of minimal command
line utilities implementing a number of these things, including converting several formats
to GFF3, merging GFF3 files, and filtering alignment results for best hits. More details on
these utilities can be found in the [components](dammit-components.md) section.

147 changes: 147 additions & 0 deletions doc/annotate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Annotate

The dammit `annotate` component uses the installed databases for transcriptome annotation.

## Just annotate it, dammit!

After you've properly installed the [databases](database-usage.md), you can start the annotation.
To run the annotation, you only need to provide a set of transcripts to annotate.


dammit run annotate TRANSCRIPTOME.fasta


Optionally, allow `dammit` to use additional threads using the `n_threads` parameter:

dammit run --n-threads 4 annotate TRANSCRIPTOME.fasta


If you'd like to customize the output or other parameters such as the e-value for similarity searches,
you can provide customization on the command line or in a configuration file.


## Additional Usage info

To see the general dammit usage information, run:

dammit run --help

You should see the following:

```
Usage: dammit run [OPTIONS] COMMAND [ARGS]...

Run the annotation pipeline or install databases.

Options:
--database-dir TEXT Directory to store databases. Existing
databases will not be overwritten.

--conda-dir TEXT Directory to store snakemake-created conda
environments.

--temp-dir TEXT Directory to store dammit temp files.
--busco-group [bacteria_odb10|acidobacteria_odb10|actinobacteria_phylum_odb10|actinobacteria_class_odb10|corynebacteriales_odb10|...]
BUSCO group(s) to use/install.
--n-threads INTEGER Number of threads for overall workflow
execution

--max-threads-per-task INTEGER Max threads to use for a single step.
--busco-config-file TEXT Path to an alternative BUSCO config file;
otherwise, BUSCO will attempt to use its
default installation which will likely only
work on bioconda. Advanced use only!

--pipeline [default|quick|full|nr]
Which pipeline to use. Pipeline options:
quick: excludes: the Infernal Rfam tasks,
the HMMER Pfam tasks, and the LAST OrthoDB
and uniref90 tasks. Best for users just
looking to get basic stats and conditional
reciprocal best LAST from a protein
database. full: Run a "complete"
annotation; includes uniref90, which is left
out of the default pipeline because it is
huge and homology searches take a long time.
nr: Also include annotation to NR database,
which is left out of the default and "full"
pipelines because it is huge and homology
searches take a long time. More info at
https://dib-lab.github.io/dammit.

--help Show this message and exit.

Commands:
annotate The main annotation pipeline.
databases The database preparation pipeline.
```

The `--pipeline` option can be used to switch the set of databases being used for annotation.
See the [annotation pipelines](pipelines.md) doc for info about each specific pipeline.
Note that these pipelines all run a core set of programs. If you re-run an annotation with a
larger pipeline, dammit will not re-run analyses that have already been completed. Instead,
dammit will run any new analyses, and integrate them into the final fasta and gff3.

To see annotation-specific configuration info, run:

dammit run annotate --help

You should see the following:

```
Usage: dammit run annotate [OPTIONS] TRANSCRIPTOME [EXTRA_SNAKEMAKE_ARGS]...

The main annotation pipeline. Calculates assembly stats; runs BUSCO; runs
LAST against OrthoDB (and optionally uniref90), HMMER against Pfam,
Infernal against Rfam, and Conditional Reciprocal Best-hit Blast against
user databases; and aggregates all results in a properly formatted GFF3
file.

Options:
-n, --base-name TEXT Base name to use for renaming the input
transcripts. The new names will be of the form
<name>_<X>. It should not have spaces, pipes,
ampersands, or other characters with special
meaning to BASH. Superseded by --regex-rename.

--regex-rename TEXT Rename transcripts using a regex pattern. The
regex should follow Python `re` format and
contain a named field keyed as `name` that
extracts the desired string. For example,
providing "(?P<name>^[a-zA-Z0-9\.]+)" will match
from the beginning of the sequence header up to
the first symbol that is not alphanumeric or a
period. Supersedes --base-name.

--rename / --no-rename If --no-rename, original transcript names are
preserved in the final annotated FASTA. --base-
name is still used in intermediate files. If
--rename (the default behavior), the renamed
transcript names are used in the final annotated
FASTA.

-e, --global-evalue FLOAT global e-value cutoff for similarity searches.
-o, --output-dir TEXT Output directory. By default this will be the
name of the transcriptome file with `.dammit`
appended

-u, --user-database TEXT Optional additional protein databases. These
will be searched with CRB-blast.

--dry-run
--help Show this message and exit.
```

Add these options as needed. For example, add annotation to a user database and specify an
output directory name like so:

dammit run annotate --user-database DB-FILE --output-dir dammit-results

General run arguments need to be added in front of `annotate`, e.g. -

dammit run --n-threads 4 --pipeline quick annotate --user-database DB-FILE --output-dir dammit-results





21 changes: 21 additions & 0 deletions doc/cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Distributing dammit jobs across a cluster

`dammit` can run on a single compute instance, or can submit each individual job to a job scheduler,
if you provide the right submission information for your cluster. Job submission is handled
via snakemake, so please see the [snakemake cluster documentation](https://snakemake.readthedocs.io/en/stable/executing/cluster.html)
for the most up-to-date version of these instructions.

## Using A Snakemake Profile for Job Submission

### Set up a snakemake profile for your cluster

We recommend using a [snakemake profile](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles)
to enable job submission for the job scheduler used by your cluster. You can start from the cookiecutter
profiles [here](https://github.com/snakemake-profiles/doc) or write your own.

### Direct dammit to use the snakemake profile

When you'd like dammit to submit jobs to a job scheduler, direct it to use your cluster profile by
adding `--profile <profile-folder-name>` at or near the end of your dammit command (after all dammit-specific arguments).
Again, see the [snakemake profile documentation](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles) for additional information.

38 changes: 38 additions & 0 deletions doc/configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Advanced Configuration

Dammit's overall memory and CPU usage can be specified at the command line.
The [annotation pipelines](pipelines.md) section contains info on the
recommended minimum resources for each pipeline.


Dammit can be configured in two ways:

- providing options on the command line
- providing options within a YAML configuration file.


## **`dammit config`**

```
Usage: dammit config [OPTIONS] COMMAND [ARGS]...

Show dammit configuration information.

Options:
--help Show this message and exit.

Commands:
busco-groups Lists the available BUSCO group databases.
clean-temp Clear out shared dammit temp files.
show-default Show the selected default configuration file.
show-directories List dammit directory locations.
```

## Tool-Specific Specification

Tool-specific parameters can be modified via a custom configuration file.





Loading