Just a DNA-Seq analysis pipeline for personal and public genomes with some plugins and scripts on top of it.
Despite being expensive in the beginning, today genome sequencing is something accessible to all of us and costs roughly 400-800 dollars. There are multiple sequencing and analysis proprietary services, however their results are often based on proprietary databases and algorithms, and their predictions are often non-transparent.
Just DNA-Seq project was created primarily for transparency reasons: we wanted to understand what is happening. We also wanted to use the latest version of the tools as we discovered that, for example, DANTE-labs were using outdated version of the genomes and GATK.
The project consists of multiple pipelines and scripts and can be either used separately or all-together. Recently we started working on longevity applications as there are no good tools or plugins for longevity genetic variant annotations.
In the project we are using WDL (Workflow Description Language) pipelines as well as OpenCravat variant annotation system. If you want to run the whole pipeline make sure you have at least 500GB or more of free space and >=16GB of RAM. All tools are dockerized, for this reason make sure that Docker is installed.
If genetic pipelines is something new for you, it can be useful to watch the Broad Institute video introduction which explains WDL, Cromwell and DNA-Seq pipelines. Even though we do not use Broad-s GATK pipeline and mix our tools a bit differently (for example we use DeepVariant for variant calling), the video explains some useful concepts in genomic analysis. For the users with only high-school knowledge of biology I would also recommend taking any free biology 101 or genetics 101 course ( https://www.edx.org/course/introduction-to-biology-the-secret-of-life-3 is a good example)
For gene annotations we use OpenCravat as well as VEP (as an alternative solution). Opencravat is included in the conda environment.
Annotation modules and dvc are included in the conda environment that goes with the project. The environment can be setup either with Anaconda or micromamba (superior version of conda). Here is installation instruction for micromamba:
wget -qO- https://micromamba.snakepit.net/api/micromamba/linux-64/latest | tar -xvj bin/micromamba
We can use ./micromamba shell init ... to initialize a shell (.bashrc) and a new root environment in ~/micromamba:
./bin/micromamba shell init -s bash -p ~/micromamba
source ~/.bashrc
To create a micromamba environment use:
micromamba create -f environment.yaml
micromamba activate gwas
The instructions above are provided for Linux and MacOS (note: in MacOS you have to install wget). For Windows you can either install Linux Subsystem or use Windows version of anaconda.
DVC is used for data management: it downloads annotations and can also be used to run some useful scripts. DVC is included to the gwas conda environment described in environment.yaml file.
In dvc.yaml there are tasks required to setup the project. For instance, to download reference genome, type:
dvc repro prepare_genome
Of course, you can try to download all the data with:
dvc repro
However, it may take quite a while as ensembl_vep_cache (which is required for VEP annotations) is >14GB. And it may happen that OpenCravat will be enough for you needs. In the Future we plan to focus on OpenCravat leaving VEP as a legacy annotation system.
To run Cromwell server, together with cromwell-client and mysql - run services with Docker-Compose If you do not have Docker installed, you can either install it yourself or use ubuntu_Script in the bin folder. To run the pipelines I recommend trying cromwell-client (deployed at port 8001 by default). It can be run by:
docker compose up
In the docker-compose configuration the following configuration is assumed:
./data/databases/mysql folder for cromwell mysql
./data/cromwell-executions for cromwell execution cache
./data/cromwell-workflow-logs for cromwell execution logs
If you have another folder layout you have to change docker-compose.yml and config/cromwell/application.conf.
The pipeline is in dna-seq-pipeline folder. It also actively uses wdl tasks and subpipelines from https://github.com/antonkulaga/bioworkflows. There dna_seq_pipeline.wdl is the main workflow, all others should be provided as dependencies. The pipeline uses Deepvariant, Strelka2 and Smoove for variant calling and VEP for variant annotations. It is also possible to run dependencies as separate workflows.
Example json inputs are provided with the parameters that I used to process my own genome. In the examples I had the following structure. This structure is the same as in ./data subfolder of this repo, feel free to modify it according to locations of your files:
- /data/gwas/anton - person's folder with:
- /data/gwas/anton/fastq - INPUT folder
- REFERENCE genome (downloaded from latest Ensembl release):
- /data/ensembl/103/species/homo_sapiens/Homo_sapiens.GRCh38.dna.primary_assembly.fa
- /data/ensembl/103/species/homo_sapiens/Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai
- OUTPUT folders created by the pipeline (no need to create them yourself, when you run the pipeline it will create folders for the output)!:
- /data/gwas/anton/aligned - output of aligned data
- /data/gwas/anton/variants - output for variants
- /data/gwas/anton/vep - vep annotations
- ANNOTATION reference files (used only by vep_annotations.wdl):
- /data/gwas/references/annotations - reference files for genetic variant annotations
- /data/ensembl/103/plugins - git cloned+renamed https://github.com/Ensembl/VEP_plugins (note 102 - is Ensembl release number)
- /data/ensembl/103/cache - folder to download Ensembl cache ( https://m.ensembl.org/info/docs/tools/vep/script/vep_cache.html )
I tested the pipeline on my personal genome sequenced by Dante. If you do not have your own genome in disposal, you can try any of the public ones, for example you can download WGS fastq files from https://www.personalgenomes.org.uk/data/. For quick test of all tools consider having small test fastq-s (example of such test is in test.json). If you have a bam file with input we provide bam_to_fastq pipeline to extract fastq-s from it.
There are three major ways of running any of the pipelines in the repository:
- with CromwellClient and Cromwell in server mode (recommended). Note: when using pipelines with multiple wdl files, do not forget to upload subworkflow files as dependencies.
- directly from Swagger API with Cromwell in a server mode: similar to running with CromwellClient but instead of the Client swagger server API is used.
- with Cromwell or any other WDL-compartible tool in the console. Documented at Official cromwell documentation.
There are two alternative annotation tools: VEP and OpenCravat. VEP is more established and oldfasioned. OpenCravat is newer and more user-friendly. I recommend to start from OpenCravat.
Opencravat is included to the environment. Before starting, it is recommended to install annotation modules of your interested. There is a dvc stage for the default modules:
dvc repro install_opencravat
The most important part of the pipeline is VEP annotations. To make it work you must download Ensembl cache and Ensembl plugins. Right now VEP annotations require additional files, for this reason they are provided as separate pipeline. Curently DVC resolves most of the files and does additional preprocessing with:
dvc repro prepare
The pipeline still requires some technical skills to run, we plan to improve ease of use and stream-line it a bit. One of the most important part is variant filtering and annotations. OpenCravat does a great job of installing huge number of annotation sources. However, its output requires some biological skills to read, we are working now on:
- reporting plugins to make reports for pre-selected genes
- longevity plugin for analysis of gene variants associated with longevity.
- mitochondrial variant calling
- longevity annotations