diff --git a/docs/source/user_guide/pipelines/gp_rnaseq.rst b/docs/source/user_guide/pipelines/gp_rnaseq.rst index 2af93b9..e0e54d6 100644 --- a/docs/source/user_guide/pipelines/gp_rnaseq.rst +++ b/docs/source/user_guide/pipelines/gp_rnaseq.rst @@ -4,7 +4,7 @@ html goseq - Stringtie + StringTie RNA Sequencing Pipeline ======================== @@ -17,11 +17,16 @@ RNA Sequencing Pipeline .. card:: - The standard MUGQIC RNA-Seq pipeline is based on the use of the `STAR aligner `_ to align reads to the reference genome. These alignments are used during downstream analysis to determine genes and transcripts differential expression. The `Cufflinks suite `_ is used for the transcript analysis whereas `DESeq `_ and `edgeR `_ are used for the gene analysis. The RNAseq pipeline requires to provide a :ref:`design` file which will be used to define group comparison in the differential analyses. The differential gene analysis is followed by a Gene Ontology (GO) enrichment analysis. This analysis use the `goseq approach `_. The goseq is based on the use of non-native GO terms (see details in the section 5 of the `corresponding vignette `_. + The standard MUGQIC RNA-Seq pipeline has three protocols: - Finally, a summary html report is automatically generated by the pipeline at the end of the analysis. This report contains description of the sequencing experiment as well as a detailed presentation of the pipeline steps and results. Various Quality Control (QC) summary statistics are included in the report and additional QC analysis is accessible for download directly through the report. The report includes also the main references of the software tools and methods used during the analysis, together with the full list of parameters that have been passed to the pipeline main script. + * StringTie + * Variants + * Cancer + + StringTie is the default protocol and applicable in most cases. - See :ref:`More Information` section below for details. + All three protocols are based on the use of the `STAR aligner `_ to align reads to the reference genome. These alignments are used during downstream analysis (for example in stringtie protocol, to determine differential expression of genes and transcripts). + .. tab-item:: Usage @@ -40,13 +45,15 @@ RNA Sequencing Pipeline .. dropdown:: Example Run - You can download `RNA Sequencing Pipeline test dataset `_ and use the following command to execute the RNA Sequencing genomics pipeline: - .. include:: /user_guide/pipelines/example_runs/rnaseq.inc .. include:: /user_guide/pipelines/notes/scriptfile_deprecation.inc - This set of commands is meant for running GenPipes on C3 data center. For more details, you can refer to GenPipes `RNA Sequencing Workshop 2018 presentation `_. + .. note:: + + This set of commands is meant for running GenPipes on C3 data center. For more details, you can refer to GenPipes `RNA Sequencing Workshop 2018 presentation `_. + + You can download `RNA Sequencing Pipeline test dataset `_ and use the following command to execute the RNA Sequencing genomics pipeline: .. dropdown:: Options @@ -56,25 +63,7 @@ RNA Sequencing Pipeline .. include:: /common/gp_common_opt.inc .. tab-item:: Schema - :name: rnaschema - - .. dropdown:: CuffLinks - - Figure below shows the schema of the RNA sequencing protocol (cufflinks). - - .. figure:: /img/pipelines/mmd/rnaseq.cufflinks.mmd.png - :align: center - :alt: rnaseq schema - :width: 100% - :figwidth: 95% - - Figure: Schema of RNA Sequencing pipeline (Cufflinks) - - .. figure:: /img/pipelines/mmd/legend.mmd.png - :align: center - :alt: dada2 ampseq - :width: 100% - :figwidth: 75% + :name: rnaschema .. dropdown:: StringTie @@ -96,59 +85,69 @@ RNA Sequencing Pipeline .. tab-item:: Steps - +----+-------------------------------------------+----------------------------------------+ - | | *RNA Sequencing (Cufflinks)* | *RNA Sequencing (Stringtie)* | - +====+===========================================+========================================+ - | 1. | |picard_sam_to_fastq| | |picard_sam_to_fastq| | - +----+-------------------------------------------+----------------------------------------+ - | 2. | |trimmomatic| | |trimmomatic| | - +----+-------------------------------------------+----------------------------------------+ - | 3. | |merge_trimmomatic_stats| | |merge_trimmomatic_stats| | - +----+-------------------------------------------+----------------------------------------+ - | 4. | |star_step| | |star_step| | - +----+-------------------------------------------+----------------------------------------+ - | 5. | |picard_merge_sam_files| | |picard_merge_sam_files| | - +----+-------------------------------------------+----------------------------------------+ - | 6. | |picard_sort_sam| | |picard_sort_sam| | - +----+-------------------------------------------+----------------------------------------+ - | 7. | |picard_mark_duplicates| | |picard_mark_duplicates| | - +----+-------------------------------------------+----------------------------------------+ - | 8. | |picard_rna_metrics| | |picard_rna_metrics| | - +----+-------------------------------------------+----------------------------------------+ - | 9. | |estimate_ribosomal_rna| | |estimate_ribosomal_rna| | - +----+-------------------------------------------+----------------------------------------+ - | 10.| |bam_hard_clip| | |bam_hard_clip| | - +----+-------------------------------------------+----------------------------------------+ - | 11.| |rnaseqc| | |rnaseqc| | - +----+-------------------------------------------+----------------------------------------+ - | 12.| |wiggle| | |wiggle| | - +----+-------------------------------------------+----------------------------------------+ - | 13.| |raw_counts| | |raw_counts| | - +----+-------------------------------------------+----------------------------------------+ - | 14.| |raw_counts_metrics| | |raw_counts_metrics| | - +----+-------------------------------------------+----------------------------------------+ - | 15.| |cufflinks| | |stringtie| | - +----+-------------------------------------------+----------------------------------------+ - | 16.| |cuffmerge| | |stringtie_merge| | - +----+-------------------------------------------+----------------------------------------+ - | 17.| |cuffquant| | |stringtie_abund| | - +----+-------------------------------------------+----------------------------------------+ - | 18.| |cuffdiff| | |ballgown| | - +----+-------------------------------------------+----------------------------------------+ - | 19.| |cuffnorm| | |differential_expression| | - +----+-------------------------------------------+----------------------------------------+ - | 20.| |fpkm_correlation_matrix| | |cram_output| | - +----+-------------------------------------------+----------------------------------------+ - | 21.| |gq_seq_utils_exploratory_analysis_rnaseq|| | - +----+-------------------------------------------+ | - | 22.| |differential_expression| | | - +----+-------------------------------------------+ | - | 23.| |differential_expression_goseq| | | - +----+-------------------------------------------+ | - | 24.| |ihec_metrics| | | - +----+-------------------------------------------+ | - | 25.| |cram_output| | | - +----+-------------------------------------------+----------------------------------------+ + +----+-----------------------------+------------------------------------+-----------------------------------+ + | | *Stringtie* | *Variants* | *Cancer* | + +====+=============================+====================================+===================================+ + | 1. | |picard_sam_to_fastq| | |picard_sam_to_fastq| | |picard_sam_to_fastq| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 2. | |trimmomatic| | |skewer_trimming| | |skewer_trimming| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 3. | |merge_trimmomatic_stats| | |sortmerna_s| | |sortmerna_s| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 4. | |sortmerna_s| | |star_step| | |star_step| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 5. | |star_step| | |picard_merge_sam_files| | |picard_merge_sam_files| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 6. | |picard_merge_sam_files| | |mark_duplicates| | |mark_duplicates| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 7. | |picard_sort_sam| | |split_N_trim| | |split_N_trim| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 8. | |mark_duplicates| | |sambamba_merge_splitNtrim_files| | |sambamba_merge_splitNtrim_files| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 9. | |picard_rna_metrics| | |gatk_indel_realigner| | |gatk_indel_realigner| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 10.| |estimate_ribosomal_rna| | |sambamba_merge_realigned| | |sambamba_merge_realigned| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 11.| |rnaseqc2| | |recalibration| | |recalibration| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 12.| |wiggle| | |gatk_haplotype_caller| | |gatk_haplotype_caller| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 13.| |raw_counts| | |merge_hc_vcf| | |merge_hc_vcf| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 14.| |raw_counts_metrics| | |run_vcfanno| | |run_vcfanno| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 15.| |stringtie_s| | |variant_filtration| | |variant_filtration| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 16.| |stringtie_merge| | |decompose_and_normalize| | |decompose_and_normalize| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 17.| |stringtie_abund| | |compute_snp_effects| | |filter_gatk| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 18.| |ballgown| | |gemini_annotations| | |report_cpsr| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 19.| |differential_expression| | |picard_rna_metrics| | |report_pcgr| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 20.| |multiqc| | |estimate_ribosomal_rna| | |run_star_fusion| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 21.| |cram_output| | |rnaseqc2| | |run_arriba| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 22.| | |gatk_callable_loci| | |run_annofuse| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 23.| | |wiggle| | |picard_rna_metrics| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 24.| | |multiqc| | |estimate_ribosomal_rna| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 25.| | |cram_output| | |rnaseqc2| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 26.| | | |rseqc| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 27.| | | |gatk_callable_loci| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 28.| | | |wiggle| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 29.| | | |multiqc| | + +----+-----------------------------+------------------------------------+-----------------------------------+ + | 30.| | | |cram_output| | + +-----------------------------------------------------------------------+-----------------------------------+ .. card:: @@ -158,7 +157,24 @@ RNA Sequencing Pipeline .. card:: - This pipeline aligns reads with `STAR `_ 2-passes mode, assembles transcripts with `Cufflinks `_, and performs differential expression with `Cuffdiff `_. In parallel, gene-level expression is quantified using `htseq-count `_, which produces raw read counts that are subsequently used for differential gene expression with both `DESeq process `_ and `edgeR algorithms `_. Several common quality metrics (e.g., ribosomal RNA content, expression saturation estimation) are also calculated through the use of `RNA-SeQC `_ and in-house scripts. Gene Ontology terms are also tested for over-representation using `GOseq `_. Expressed short single-nucleotide variants (SNVs) and indels calling is also performed by this pipeline, which optimizes GATK best practices to reach a sensitivity of 92.8%, precision of 87.7%, and F1 score of 90.1%. + **StringTie Protocol** + + The [StringTie](https://ccb.jhu.edu/software/stringtie/) suite is used for differential transcript expression (DTE) analysis, whereas [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) and [edgeR](http://bioconductor.org/packages/release/bioc/html/edgeR.html) are used for the differential gene expression (DGE) analysis. + + StringTie protocol requires a design file which will be used to define the comparison groups + in the differential analyses. The design file format is described [here](https://genpipes.readthedocs.io/en/latest/get-started/concepts/design_file.html). In addition, [Ballgown](https://bioconductor.org/packages/release/bioc/html/ballgown.html) is used to calculate differential transcript and gene expression levels and test them for significant differences. It can also take a batch file (optional) which will be used to correct for batch effects in the differential analyses. The batch file format is described [here](https://bitbucket.org/mugqic/mugqic_pipelines/src#markdown-header-batch-file). + + **Variants Protocol** + + The variants protocol is used when variant detection, is the main goal of the analysis. GATK best practices workflow is used for variant calling. It also contains a step for annotating genes using the `gemini protocol `_. + + **Cancer Protocol** + + The cancer protocol contains all the steps in the variants protocol but it is specifically designed for cancer data sets due to the complexity of cancer samples and additional analyses those projects often entail. The goal of the cancer protocol is comparing expression to known benchmarks. In addition to the steps in the variants protocol, it contains four specific steps. Three of them (run_star_fusion, run_arriba, run_annofuse) are related to detection and annotation of gene fusions. For that, [Star-fusion](https://github.com/STAR-Fusion/STAR-Fusion-Tutorial/wiki), [Arriba](https://arriba.readthedocs.io/en/latest/) and [anno-Fuse](https://rdrr.io/github/d3b-center/annoFuse/) are used. Furthermore, [RSeQC](http://rseqc.sourceforge.net/) provides RNA-seq quality control metrics to asses the quality of the data. + + Finally, a summary html report is automatically generated by the pipeline at the end of the analysis. This report contains description of the sequencing experiment as well as a detailed presentation of the pipeline steps and results. Various Quality Control (QC) summary statistics are included in the report and additional QC analysis is accessible for download directly through the report. The report includes also the main references of the software tools and methods used during the analysis, together with the full list of parameters that have been passed to the pipeline main script. + + See :ref:`More Information` section below for details. .. _More Information on RNA Sequencing: @@ -174,32 +190,46 @@ For the latest implementation and usage details refer to RNA Sequencing implemen .. The following are replacement texts used in this file .. |picard_sam_to_fastq| replace:: `Picard SAM to FastQ`_ -.. |trimmomatic| replace:: `Trimmomatic`_ -.. |merge_trimmomatic_stats| replace:: `Merge Trimmomatic Stats`_ .. |star_step| replace:: `Star Processing`_ +.. |run_star_fusion| replace:: `Run Star Fusion`_ .. |picard_merge_sam_files| replace:: `Picard Merge SAM Files`_ .. |picard_sort_sam| replace:: `Picard Sort SAM`_ -.. |picard_mark_duplicates| replace:: `Picard Mark Duplicates`_ .. |picard_rna_metrics| replace:: `Picard RNA Metrics`_ +.. |mark_duplicates| replace:: `Mark Duplicates`_ +.. |trimmomatic| replace:: `Trimmomatic`_ +.. |merge_trimmomatic_stats| replace:: `Merge Trimmomatic Stats`_ .. |estimate_ribosomal_rna| replace:: `Estimate Ribosomal RNA`_ -.. |bam_hard_clip| replace:: `BAM Hard Clip`_ .. |rnaseqc| replace:: `RNA Seq Compress`_ .. |wiggle| replace:: `Wiggle`_ .. |raw_counts| replace:: `Raw Counts`_ .. |raw_counts_metrics| replace:: `Raw Counts Metrics`_ -.. |cufflinks| replace:: `Cufflinks Process`_ -.. |cuffmerge| replace:: `Cuffmerge Process`_ -.. |cuffquant| replace:: `Cuffquant Step`_ -.. |cuffdiff| replace:: `Cuffdiff Process`_ -.. |cuffnorm| replace:: `Cuffnorm Normalization`_ -.. |fpkm_correlation_matrix| replace:: `FPKM Correlation`_ -.. |gq_seq_utils_exploratory_analysis_rnaseq| replace:: `GQ RNA Sequencing Utility`_ .. |differential_expression| replace:: `Differential Expression`_ -.. |differential_expression_goseq| replace:: `Differential Expression GO sequencing`_ -.. |ihec_metrics| replace:: `IHEC Metrics`_ -.. |stringtie| replace:: `Stringtie`_ -.. |stringtie_abund| replace:: `Stringtie Assemble Transcriptome`_ -.. |stringtie_merge| replace:: `Stringtie Merge`_ +.. |stringtie_s| replace:: `StringTie Step`_ +.. |stringtie_abund| replace:: `StringTie Abund`_ +.. |stringtie_merge| replace:: `StringTie Merge`_ .. |ballgown| replace:: `Ballgown Gene Expression`_ +.. |sortmerna_s| replace:: `Sortmerna Step`_ +.. |rnaseqc2| replace:: `Rnaseqc2`_ +.. |skewer_trimming| replace:: `Skewer Trimming`_ +.. |split_N_trim| replace:: `Split N Trim`_ +.. |sambamba_merge_splitNtrim_files| replace:: `Sambamba Merge Split N Trim Files`_ +.. |sambamba_merge_realigned| replace:: `Sambamba Merge Realigned`_ +.. |gatk_indel_realigner| replace:: `GATK Indel Realigner`_ +.. |gatk_haplotype_caller| replace:: `GATK Haplotype Caller`_ +.. |gatk_callable_loci| replace:: `GATK Callable Loci`_ +.. |filter_gatk| replace:: `Filter GATK`_ +.. |recalibration| replace:: `Recalibration`_ +.. |merge_hc_vcf| replace:: `Merge HC VCF`_ +.. |run_vcfanno| replace:: `Run VCF Anno`_ +.. |variant_filtration| replace:: `Variant Filtration`_ +.. |decompose_and_normalize| replace:: `Decompose and Normalize`_ +.. |compute_snp_effects| replace:: `Compute SNP Effects`_ +.. |gemini_annotations| replace:: `Gemini Annotations`_ +.. |report_cpsr| replace:: `Report CPSR`_ +.. |report_pcgr| replace:: `Report PCGR`_ +.. |run_arriba| replace:: `Run Arriba`_ +.. |run_annofuse| replace:: `Run Annofuse`_ +.. |rseqc| replace:: `RSeqC`_ +.. |multiqc| replace:: `Multiqc Report`_ .. include:: repl_cram_op.inc diff --git a/docs/source/user_guide/pipelines/steps_rnaseq.inc b/docs/source/user_guide/pipelines/steps_rnaseq.inc index a3bd909..0b0ec4b 100644 --- a/docs/source/user_guide/pipelines/steps_rnaseq.inc +++ b/docs/source/user_guide/pipelines/steps_rnaseq.inc @@ -55,6 +55,12 @@ This step takes as input files: * Else, FASTQ files from the readset file if available * Else, FASTQ output files from previous picard_sam_to_fastq conversion of BAM files +.. _Run Star Fusion: + +**Run Star Fusion** + +`STAR-Fusion `_ is a component of the Trinity Cancer Transcriptome Analysis Toolkit (CTAT). Based on the STAR aligner it identifies candidate fusion transcripts supported by Illumina reads. + .. _Picard Merge SAM Files: **Picard Merge SAM Files** @@ -67,11 +73,11 @@ BAM readset files are merged into one file per sample. Merge is done using `Pica The alignment file is reordered (QueryName) using `Picard Tool `_. The QueryName-sorted BAM files will be used to determine raw read counts. -.. _Picard Mark Duplicates: +.. _Mark Duplicates: -**Picard Mark Duplicates** +**Mark Duplicates** -Mark duplicates. Aligned reads per sample are duplicates if they have the same 5' alignment positions (for both mates in the case of paired-end reads). All but the best pair (based on alignment score) will be marked as a duplicate in the BAM file. Marking duplicates is done using `Picard package `_. +This step handles duplicates. Aligned reads per sample are duplicates if they have the same 5' alignment positions (for both mates in the case of paired-end reads). All but the best pair (based on alignment score) will be marked as a duplicate in the BAM file. Marking duplicates is done using `Picard package `_. .. _Picard RNA Metrics: @@ -85,12 +91,6 @@ Computes a series of quality control metrics using both CollectRnaSeqMetrics and This step uses readset BAM files and bwa mem to align reads on the rRNA reference fasta and count the number of read mapped The filtered reads are aligned to a reference fasta file of ribosomal sequence. The alignment is done per sequencing readset. The alignment software used is `BWA package `_ with algorithm: bwa mem. BWA output BAM files are then sorted by coordinate using `Picard package `_. -.. _BAM Hard Clip: - -**BAM Hard Clip** - -Generate a hardclipped version of the BAM for the toxedo suite which does not support this official SAM feature. - .. _RNA Seq Compress: **RNA Seq Compress** @@ -115,89 +115,182 @@ Count reads in feature using `HT Seq Count `_. Warning: It needs to use a hard clipped bam file while Tuxedo tools do not support official soft clip SAM format. +Performs differential gene expression analysis using `DESEQ package `_ and `EDGER package `_. Merge the results of the analysis in a single csv file. -.. _Cuffmerge Process: +.. _StringTie Step: -**Cuffmerge Process** +**StringTie Step** -Merge assemblies into a master transcriptome reference using `cuffmerge package `_. +Assemble transcriptome using `StringTie assembler `_. -.. _Cuffquant Step: +.. _StringTie Merge: -**Cuffquant Step** +**StringTie Merge** -Compute expression profiles (abundances.cxb) using `cuffquant `_. Warning: It needs to use a hard clipped bam file while Tuxedo tools do not support official soft clip SAM format. +Merge assemblies into a master transcriptome reference using `StringTie assembler `_. -.. _Cuffdiff Process: +.. _StringTie Abund: -**Cuffdiff Process** +**StringTie Abund** -`Cuffdiff package `_ is used to calculate differential transcript expression levels and test them for significant differences. +Assemble transcriptome and compute RNA-seq expression using `StringTie `_. -.. _Cuffnorm Normalization: +.. _Ballgown Gene Expression: -**Cuffnorm Normalization** +**Ballgown Gene Expression** -This step performs global normalization of RNA-Sequence expression levels using `Cuffnorm algorithm `_. +`Ballgown tool `_ is used to calculate differential transcript and gene expression levels and test them for significant differences. -.. _FPKM Correlation: +.. _Sortmerna Step: -**FPKM Correlation** +**Sortmerna Step** -Compute the pearson correlation matrix of gene and transcripts FPKM. FPKM data are those estimated by cuffnorm. +This step calculates the ribosomal RNA per read based on known ribosomal sequences from archea, bacteria and eukaryotes. It uses `sortmeRNA `_ protocol that takes trimmed fastqs and reports on each read, either paired-end or single end. -.. _GQ RNA Sequencing Utility: +.. _Rnaseqc2: -**GQ RNA Sequencing Utility** +**Rnaseqc2** -Exploratory analysis using the `gqSeqUtils R package `_. +Computes a series of quality control metrics using `RNA-SeQC `_. -.. _Differential Expression: +.. _Skewer Trimming: -**Differential Expression** +**Skewer Trimming** -Performs differential gene expression analysis using `DESEQ package `_ and `EDGER package `_. Merge the results of the analysis in a single csv file. +`Skewer `_ is used mainly for +detection and trimming adapter sequences from raw fastq files. Other features of Skewer is listed +`here `_. -.. _Differential Expression GO sequencing: +.. _Split N Trim: -**Differential Expression GO sequencing** +**Split N Trim** -Gene Ontology analysis for RNA-Seq using the `Bioconductor's R package goseq `_. -Generates GO annotations for differential gene expression analysis. +During the 'Split N Trim' step, a `GATK Tool `_ called `SplitNCigarReads` developed specially for RNAseq, splits reads into exon segments. During this splitting, it gets rid of Ns but maintains grouping information and hard-clips any sequences overhanging into the intronic regions. -.. _IHEC Metrics: +.. _Sambamba Merge Split N Trim Files: -**IHEC Metrics** +**Sambamba Merge Split N Trim Files** -Generate IHEC's standard metrics. +BAM readset files are merged into one file per sample. Merge is done using `Sambamba Merge `_. -.. _Stringtie: +.. _Sambamba Merge Realigned: -**Stringtie** +**Sambamba Merge Realigned** -Assemble transcriptome using `Stringtie assembler `_. +In this step, BAM files of regions of realigned reads are merged per sample using `Sambamba `_. -.. _Stringtie Assemble Transcriptome: +.. _GATK Indel Realigner: -**Stringtie Assemble Transcriptome** +**GATK Indel Realigner** -This step assembles transcriptome and compute RNA-seq expression using `Stringtie assembler `_. +Insertion and deletion realignment is performed on regions where multiple base mismatches are preferred over indels by the aligner since it can appear to be less costly by the algorithm. Such regions will introduce false positive variant calls which may be filtered out by realigning those regions properly. Realignment is done using `GATK `_. The reference genome is divided by a number regions given by the `nb_jobs` parameter. -.. _Stringtie Merge: +.. _GATK Haplotype Caller: -**Stringtie Merge** +**GATK Haplotype Caller** -Merge assemblies into a master transcriptome reference using `Stringtie assembler `_. +`GATK haplotype caller step `_ is used for SNPs and small indels. The Haplotype caller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. Regions that contain different types of variants close to each other are traditionally difficult to call. For such regions, HaplotypeCaller is more accurate. This is because whenever it encounters such regions with different types of variants, it discards the existing mapping information and completely reassembles the reads in that region. -.. _Ballgown Gene Expression: +.. _GATK Callable Loci: -**Ballgown Gene Expression** +**GATK Callable Loci** -`Ballgown tool `_ is used to calculate differential transcript and gene expression levels and test them for significant differences. +This step computes the callable region or the genome as a BED track. + +.. _Filter GATK: + +**Filter GATK** + +As part of filter GATK processing, a custom script is applied to inject FORMAT information - tumor/normal DP and VAP into the INFO field +of the filter on those generated fields. + +.. _Recalibration: + +**Recalibration** + +In this step, we recalibrate the base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, +the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so, provides not only more accurate quality scores but also more widely dispersed ones. + +.. _Merge HC VCF: + +**Merge HC VCF** + +Merges VCFs from `Haplotype caller `_ to generate a sample level VCF. + +.. _Run VCF Anno: + +**Run VCF Anno** + +`VCFAnno `_ is used to annotate VCF files with preferred INFO fields from anu number of VCF or BED files. + +.. _Variant Filtration: + +**Variant Filtration** + +`VariantFiltration `_ is a GATK tool for hard-filtering variant calls based on certain criteria. Records are hard-filtered +by changing the value in the FILTER field to something other than PASS. + +.. _Decompose and Normalize: + +**Decompose and Normalize** + +The `vt Normalization `_ is used to normalized and decompose VCF files. For more +information about normalizing and decomposing visit `Variant Normalization `_. An indexed file is also generated from the output file using `htslib `_. + +.. _Compute SNP Effects: + +**Compute SNP Effects** + +`SnpEff `_ is used to variant annotation and effect prediction on genes by using an interval forest approach. It annotates and predicts the effects of genetic variants, such as amino acid changes. + +.. _Gemini Annotations: + +**Gemini Annotations** + +`Gemini `_ (GEnome MINIng) is used to integrative exploration of genetic +variation and genome annotations. See `latest Gemini documentation `_ for more information. + +.. _Report CPSR: + +**Report CPSR** + +In this step a `cpsr `_ germline report is created input: filtered ensemble germline vcf +output: html report and addtional flat files. + +.. _Report PCGR: + +**Report PCGR** + +Creates a `PCGR `_ somatic + germline report. Input is a filtered ensemble germline VCF and the output is an html report with additional flat files. + +.. _Run Arriba: + +**Run Arriba** + +`Arriba `_ is a command-line tool for the detection of gene fusions from RNA-Seq data. It is based on the `STAR `_ aligner. Apart from gene fusions, Arriba can detect other structural rearrangements with potential clinical relevance, including viral integration sites, internal tandem duplications, whole exon duplications and intragenic inversions, etc. + +.. _Run Annofuse: + +**Run Annofuse** + +`Annofuse `_ is an R package used to annotate, prioritize, and interactively explore putative oncogenic RNA fusions. + +.. _RSeqC: + +**RSeqC** + +RSeqC computes a series of quality control metrics using both CollectRnaSeqMetrics and CollectAlignmentSummaryMetrics functions +metrics are collected using `Picard `_. + +.. _Multiqc Report: + +**Multiqc Report** + +A quality control report for all samples is generated. +For more detailed information see `MultiQC documentation `_. .. include:: steps_cram_op.inc