NB - this code / pipeline structure remains mostly untouched since it was first written by me in 2013 / 14 while working in the National Health Service (NHS) England - it would be designed differently were it written today, and likely using a proper workflow manager. The main areas where it could be developed are:
- improved command line parameter parsing
- improved error handling, for example, checking output of each code section and making a decision to proceed or not
Although it runs AOK in its current state, anyone re-using this code should make changes as you see fit.
I update this pipeline as I re-use it myself in order to keep it maintained on a low level in line with new program versions that are released.
Last update: Sunday, 31st March, 2019 @ 02:20 BST (GMT+1)
Automated next generation DNA sequencing analysis pipeline 'suited' for clinical tests, with >99.9% sensitivity to Sanger sequencing for Single Nucleotide Variants (SNVs) at read-depth>18 over target regions over interest.
This clinical-grade analysis pipeline, ClinicalGradeDNAseq, is a watered-down and modified version of the work by Blighe, Beauchamp, and colleagues at Sheffield Diagnostic Genetics Service, Sheffield Children's NHS Foundation Trust, Sheffield, UK, and their efforts to introduce a clinical-grade next generation sequencing (NGS) analysis pipeline fully validated against Sanger di-deoxy sequencing.
The pipeline is built using open source programs mixed with customised scripts. A wrapper script manages command line parameters and then executes the master analysis script. Control is then returned to the wrapper, where results files are transferred to a remote server via SSH/sFTP. A master and concise log is kept, with date- and time-stamps. Results directory structure is formed based on the run number and patient ID.
[Key point] The unique feature of the analysis pipeline that increases sensitivity to Sanger sequencing is in the variant calling step, where a final aligned BAM is split into 3 'sub-BAMs' of 75%, 50%, and 25% random reads. Variants are then called on all 4 BAMs, after which a consensus VCF is produced.
This pipeline proceeds in an 8-step process:
- Adaptor and read quality trimming - TrimGalore! (Krueger F), FastQC (Andrews S), cutadapt (Martin M, 2011)
- Alignment - bwa mem (Li & Durbin, 2009)
- Marking and removing PCR duplicates - Picard (Broad Institute of MIT and Harvard), SAMtools (Li et al., 2009)
- Remove low mapping quality reads - SAMtools (Li et al., 2009)
- QC - SAMtools (Li et al., 2009), BEDTools (Quinlan & Hall, 2010), custom scripts
- Downsampling / random read sampling - Picard (Broad Institute of MIT and Harvard)
- Variant calling - SAMtools/BCFtools (Li et al., 2009)
- Annotation - Variant Effect Predictor (McLaren et al., 2016)
- Paired-end FASTQ files
- Chromosome-ordered and position-sorted BED file in hg19 / hg38 (BED files can be sorted with sort -k1,1V -k2,2n)
- Global installations of cutadapt and unix2dos (included in dos2unix)
- Valid credentials for returning results files to remote server (username / password) via SSH/sFTP
-
Run the ‘PipelineWrapper’ wrapper script, which will check command-line parameters, execute the master script, and then return results files via SSH/sFTP to remote server. Use the following parameters:
- FASTQ mate-pair 1 (absolute file path)
- FASTQ mate-pair 2 (absolute file path)
- Reference genome FASTA (absolute file path)
- Run number (e.g. Plate6, Plate7, etc.) (alphanumeric)
- Patient ID (alphanumeric)
- BED file (absolute file path)
- Minimum quality for bases at read ends, below which bases will be cut (integer)
- Minimum allowed read length (integer)
- Adaptor for trimming off read ends ('illumina' / 'nextera' / 'small_rna')
- Minimum read depth for calling a variant (integer)
- Minimum allowed mapping quality (integer)
- Stringency for calling variants ('relaxed' / 'normal') (relaxed uses
--pval-threshold 1.0
with BCFtools call) - Directory where results will be output (absolute file path)
- User initials (alphanumeric)
- *_AnalysisLog.txt - analysis log (short)
- Master.log - analysis log (comprehensive)
- *_R1_001.fastq.gz_trimming_report.txt - details on base and read trimming for mate-pair 1
- *_R1_001_val_1_fastqc.html - FastQC report for mate-pair 1 (after trimming)
- *_R2_001.fastq.gz_trimming_report.txt - details on base and read trimming for mate-pair 2
- *_R2_001_val_2_fastqc.html - FastQC report for mate-pair 2 (after trimming)
- *_Alignment.txt - alignment metrics
- *_ReadsOffTarget.txt - number of reads falling outside regions specified in BED file
- *_PCRDuplicates.txt - details on identified PCR duplicates
- *_CoverageTotal.bedgraph - coverage for all mapped locations (contiguous bases at same read depth are merged into regions)
- *_MeanCoverageBED.bedgraph - mean read depth for each region specified in supplied BED file
- *_PerBaseDepthBED.bedgraph - per base read depth for each base in each region specified in supplied BED file
- *_PercentGenomeCovered.txt - percentage of reference genome covered by reads.
- *_Aligned_Sorted_PCRDuped_FiltMAPQ.bam - aligned BAM file with sorted reads, PCR duplicates removed, and reads below mapping quality threshold removed
- *_Aligned_Sorted_PCRDuped_FiltMAPQ.bam.bai - index for above BAM file
- *_Final.vcf - final VCF file
- *_AnnotationVEP.html - HTML report of variant annotation, with consequences for all known transcript isoforms
- *_AnnotationVEP.tsv - as above but in tab-separated values (TSV) format
- PipelineWrapper.sh, line 120:
/home/ubuntu/pipeline/AnalysisMasterVersion1.sh "${Read1}" "${Read2}" ...
- absolute path filename for AnalysisMasterVersion1.sh - PipelineWrapper.sh, line 138:
remoteDir="/remote/SAMBA/share/"
- Remote server directory to which results files will be transferred via SSH/sFTP - PipelineWrapper.sh, line 150, 161:
sshpass -e sftp [email protected] << !
- Remote server IP address or host name to which results files will be transferred via SSH/sFTP - AnalysisMasterVersion1.sh, lines 25-35 - root directories (absolute paths) of required programs
- AnalysisMasterVersion1.sh, lines 286/304 - minimum base quality (
--min-BQ
) set to 30 for BCFtools mpileup - AnalysisMasterVersion1.sh, lines 374 - extra filters for filtering variants via vcfutils.pl (bundled with BCFtools in 'misc' directory)
- AnalysisMasterVersion1.sh, line 396/7 - species and assembly for VEP set to Homo sapiens and GRCh38
- Andrews S, FastQC, https://www.bioinformatics.babraham.ac.uk/projects/fastqc/, last accessed 28th August 2017.
- Broad Institute of MIT and Harvard, Picard, http://broadinstitute.github.io/picard/, last accessed 28th August 2017
- Krueger F, Trim Galore!, https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/, last accessed 28th August 2017.
- Li H and Durbin R (2009), Fast and accurate short read alignment with Burrows-Wheeler transform, Bioinformatics 25(14): 1754–1760.
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup (2009), The Sequence Alignment/Map format and SAMtools, Bioinformatics 25(16): 2078-9.
- Martin M (2011), Cutadapt removes adapter sequences from high-throughput sequencing reads, EMBnet.journal 17(1): 10-12.
- McLaren W, Gil L, Hunt S, Riat H, Ritchie G, Thormann A, Flicek P, Cunningham F (2016), The Ensembl Variant Effect Predictor, Genome Biology 17: 122.
- Quinlan AR & Hall IM (2010), BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics 26(6): 841-2.
- Kevin Blighe (Sheffield Children's NHS Foundation Trust)
- Nick Beauchamp (Sheffield Children's NHS Foundation Trust)
- Darren Grafham (Sheffield Children's NHS Foundation Trust)
- Lucy Crookes (Sheffield Children's NHS Foundation Trust)
- Sasirekha Palaniswamy ('Sashi') (Sheffield Children's NHS Foundation Trust)
- Sheffield Diagnostics Genetics Service