This document outlines all the steps and considerations for calling and merging a trio using DeepVariant and GLnexus. These best practices were developed and evaluated as described in the article published in Bioinformatics: Accurate, scalable cohort variant calls using DeepVariant and GLnexus (2021).
The process involves 3 major stages: running DeepVariant to create individual genome call sets, running GLnexus to merge call sets, and analyzing the merged call set.
NOTE: This case study demonstrates an example of how to run DeepVariant end-to-end on one machine. The steps below were done on a machine with this example command to start a machine.
The steps in this document can be extended to merge larger cohorts as well.
See this workflow:
A few things to note before we start:
- If you are looking for ways to run DeepVariant in larger batches, please refer to the third party solutions section.
- It is recommended to use BAM files with original quality scores. In the case
that BAM files went through recalibration, optional DV flags can be used in
order to use original scores:
--parse_sam_aux_fields
,--use_original_quality_scores
. - DeepVariant optionally allows gVCF output. This option is required for further GLnexus analysis in this document.
The Whole Exome Sequencing (WES) dataset we're using is from:
ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/
- HG002_NA24385_son
- HG003_NA24149_father
- HG004_NA24143_mother
Just for convenience, we use aria2 to download our data. You can change it to whatever other tools (wget, curl) that you prefer.
To install aria2, you can run: sudo apt-get -y install aria2
DIR="${PWD}/trio"
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bam -o HG002.bam
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bai -o HG002.bai
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG003_NA24149_father/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG003-EEogPU_v02-KIT-Av5_TCTTCACA_L008.posiSrt.markDup.bam -o HG003.bam
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG003_NA24149_father/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG003-EEogPU_v02-KIT-Av5_TCTTCACA_L008.posiSrt.markDup.bai -o HG003.bai
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG004_NA24143_mother/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG004-EEogPU_v02-KIT-Av5_CCGAAGTA_L008.posiSrt.markDup.bam -o HG004.bam
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG004_NA24143_mother/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG004-EEogPU_v02-KIT-Av5_CCGAAGTA_L008.posiSrt.markDup.bai -o HG004.bai
aria2c -c -x10 -s10 -d "${DIR}" https://storage.googleapis.com/deepvariant/exome-case-study-testdata/hs37d5.fa.gz
gunzip ${DIR}/hs37d5.fa.gz
aria2c -c -x10 -s10 -d "${DIR}" https://storage.googleapis.com/deepvariant/exome-case-study-testdata/hs37d5.fa.fai
aria2c -c -x10 -s10 -d "${DIR}" https://storage.googleapis.com/deepvariant/exome-case-study-testdata/agilent_sureselect_human_all_exon_v5_b37_targets.bed
HG002:
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh37/HG002_GRCh37_1_22_v4.2.1_benchmark.vcf.gz -o HG002_truth.vcf.gz
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh37/HG002_GRCh37_1_22_v4.2.1_benchmark.vcf.gz.tbi -o HG002_truth.vcf.gz.tbi
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh37/HG002_GRCh37_1_22_v4.2.1_benchmark_noinconsistent.bed -o HG002_truth.bed
HG003:
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh37/HG003_GRCh37_1_22_v4.2.1_benchmark.vcf.gz -o HG003_truth.vcf.gz
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh37/HG003_GRCh37_1_22_v4.2.1_benchmark.vcf.gz.tbi -o HG003_truth.vcf.gz.tbi
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh37/HG003_GRCh37_1_22_v4.2.1_benchmark_noinconsistent.bed -o HG003_truth.bed
HG004:
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG004_NA24143_mother/NISTv4.2.1/GRCh37/HG004_GRCh37_1_22_v4.2.1_benchmark.vcf.gz -o HG004_truth.vcf.gz
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG004_NA24143_mother/NISTv4.2.1/GRCh37/HG004_GRCh37_1_22_v4.2.1_benchmark.vcf.gz.tbi -o HG004_truth.vcf.gz.tbi
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG004_NA24143_mother/NISTv4.2.1/GRCh37/HG004_GRCh37_1_22_v4.2.1_benchmark_noinconsistent.bed -o HG004_truth.bed
(No need to install bcftools and other tools, because they are now installed in the DeepVariant images.)
First, install docker if you don't have it yet: sudo apt-get -y install docker.io
With the example command below, it runs DeepVariant on the trio one by one. This is for demonstration only. If you're running this on a large cohort, running serially is not the most effective approach.
N_SHARDS=$(nproc) # Or change to the number of cores you want to use
CAPTURE_BED=agilent_sureselect_human_all_exon_v5_b37_targets.bed
VERSION=1.2.0
declare -a trio=(HG002 HG003 HG004)
for SAMPLE in "${trio[@]}"
do
BAM=${SAMPLE}.bam
OUTPUT_VCF=${SAMPLE}.vcf.gz
OUTPUT_GVCF=${SAMPLE}.g.vcf.gz
time sudo docker run \
-v "${DIR}":"/data" \
google/deepvariant:${VERSION} \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WES \
--ref="/data/hs37d5.fa" \
--reads="/data/${BAM}" \
--regions="/data/${CAPTURE_BED}" \
--output_vcf="/data/${OUTPUT_VCF}" \
--output_gvcf="/data/${OUTPUT_GVCF}" \
--num_shards=${N_SHARDS}
done
Note: The BAM files should provide unique names for each sample in their SM
header tag, which is usually derived from a command-line flag to the read
aligner. If your BAM files don't have unique SM
tags (and if it's not feasible
to adjust the alignment pipeline), add the --sample_name=XYZ
flag to
run_deepvariant
to override the sample name written into the gVCF file header.
And then run GLnexus with this config:
sudo docker pull quay.io/mlin/glnexus:v1.2.7
time sudo docker run \
-v "${DIR}":"/data" \
quay.io/mlin/glnexus:v1.2.7 \
/usr/local/bin/glnexus_cli \
--config DeepVariantWES \
--bed "/data/${CAPTURE_BED}" \
/data/HG004.g.vcf.gz /data/HG003.g.vcf.gz /data/HG002.g.vcf.gz \
| sudo docker run -i google/deepvariant:${VERSION} bcftools view - \
| sudo docker run -i google/deepvariant:${VERSION} bgzip -c \
> ${DIR}/deepvariant.cohort.vcf.gz
When we ran on this WES trio, it took only about 13 seconds. For more details on performance, see GLnexus performance guide.
For a WGS cohort, we recommend using --config DeepVariantWGS
instead of
DeepVariantWES
. Another preset DeepVariant_unfiltered
is available in
glnexus:v1.2.7
or later versions for merging DeepVariant gVCFs with no QC
filters or genotype revision (see
GitHub issue #326 for a
potential use case). The details of these presets can be found
here.
Create an SDF template from our reference file:
sudo docker run \
-v "${DIR}":"/data" \
realtimegenomics/rtg-tools format \
-o /data/hs37d5.sdf /data/hs37d5.fa
Create a PED file $DIR/trio.ped
that looks like this (with the sample name
of the trio):
#PED format pedigree
#
#fam-id/ind-id/pat-id/mat-id: 0=unknown
#sex: 1=male; 2=female; 0=unknown
#phenotype: -9=missing, 0=missing; 1=unaffected; 2=affected
#
#fam-id ind-id pat-id mat-id sex phen
1 Sample_Diag-excap51-HG002-EEogPU Sample_Diag-excap51-HG003-EEogPU Sample_Diag-excap51-HG004-EEogPU 1 0
1 Sample_Diag-excap51-HG003-EEogPU 0 0 1 0
1 Sample_Diag-excap51-HG004-EEogPU 0 0 2 0
sudo docker run \
-v "${DIR}":"/data" \
realtimegenomics/rtg-tools mendelian \
-i /data/deepvariant.cohort.vcf.gz \
-o /data/deepvariant.annotated.vcf.gz \
--pedigree=/data/trio.ped \
-t /data/hs37d5.sdf \
| tee ${DIR}/deepvariant.input_rtg_output.txt
The output is:
Checking: /data/deepvariant.cohort.vcf.gz
Family: [Sample_Diag-excap51-HG003-EEogPU + Sample_Diag-excap51-HG004-EEogPU] -> [Sample_Diag-excap51-HG002-EEogPU]
Concordance Sample_Diag-excap51-HG002-EEogPU: F:56568/57063 (99.13%) M:56963/57085 (99.79%) F+M:56331/56958 (98.90%)
Sample Sample_Diag-excap51-HG002-EEogPU has less than 99.0 concordance with both parents. Check for incorrect pedigree or sample mislabelling.
821/57304 (1.43%) records did not conform to expected call ploidy
57213/57304 (99.84%) records were variant in at least 1 family member and checked for Mendelian constraints
199/57213 (0.35%) records had indeterminate consistency status due to incomplete calls
636/57213 (1.11%) records contained a violation of Mendelian constraints
From this report, we know that there is a 1.13% Mendelian violation rate, and
0.34% of the records had incomplete calls (with .
) so RTG couldn't determine
whether there is violation or not.
In addition to the cohort quality statistics, for completeness we generate single-sample quality metrics.
We run bcftools stats
on the 3 VCF outputs. Since our DeepVariant run already
constrained to just the capture regions, no need to specify it again here. We
had to pass in the -f PASS
flag so that only the PASS calls are considered.
declare -a trio=(HG002 HG003 HG004)
for SAMPLE in "${trio[@]}"
do
sudo docker run \
-v ${DIR}:${DIR} \
google/deepvariant:${VERSION} \
bcftools stats -f PASS \
${DIR}/${SAMPLE}.vcf.gz \
> ${DIR}/${SAMPLE}.stats
done
Sample | [3]ts | [4]tv | [5]ts/tv | [6]ts (1st ALT) | [7]tv (1st ALT) | [8]ts/tv (1st ALT) |
---|---|---|---|---|---|---|
HG002 | 29817 | 11626 | 2.56 | 29805 | 11608 | 2.57 |
HG003 | 29694 | 11675 | 2.54 | 29684 | 11653 | 2.55 |
HG004 | 29912 | 11803 | 2.53 | 29902 | 11785 | 2.54 |
If you want to restrict to the truth BED files, use this command:
declare -a trio=(HG002 HG003 HG004)
for SAMPLE in "${trio[@]}"
do
sudo docker run \
-v ${DIR}:${DIR} \
google/deepvariant:${VERSION} \
bcftools stats -f PASS \
-T ${DIR}/${SAMPLE}_truth.bed \
${DIR}/${SAMPLE}.vcf.gz \
> ${DIR}/${SAMPLE}.with_truth_bed.stats
done
Which resulted in this table:
Sample | [3]ts | [4]tv | [5]ts/tv | [6]ts (1st ALT) | [7]tv (1st ALT) | [8]ts/tv (1st ALT) |
---|---|---|---|---|---|---|
HG002 | 27688 | 10538 | 2.63 | 27680 | 10524 | 2.63 |
HG003 | 27331 | 10504 | 2.60 | 27326 | 10492 | 2.60 |
HG004 | 27470 | 10596 | 2.59 | 27463 | 10584 | 2.59 |
declare -a trio=(HG002 HG003 HG004)
for SAMPLE in "${trio[@]}"
do
sudo docker run \
-v "${DIR}":"/data" \
realtimegenomics/rtg-tools vcfstats \
/data/${SAMPLE}.vcf.gz \
> ${DIR}/${SAMPLE}.vcfstats
done
which shows the following:
HG002:
Location : /data/HG002.vcf.gz
Failed Filters : 14785
Passed Filters : 45073
SNPs : 41413
MNPs : 0
Insertions : 1859
Deletions : 1781
Indels : 18
Same as reference : 2
SNP Transitions/Transversions: 2.56 (41695/16270)
Total Het/Hom ratio : 1.49 (26942/18129)
SNP Het/Hom ratio : 1.51 (24882/16531)
MNP Het/Hom ratio : - (0/0)
Insertion Het/Hom ratio : 1.09 (969/890)
Deletion Het/Hom ratio : 1.52 (1073/708)
Indel Het/Hom ratio : - (18/0)
Insertion/Deletion ratio : 1.04 (1859/1781)
Indel/SNP+MNP ratio : 0.09 (3658/41413)
HG003:
Location : /data/HG003.vcf.gz
Failed Filters : 15645
Passed Filters : 44937
SNPs : 41332
MNPs : 0
Insertions : 1852
Deletions : 1733
Indels : 18
Same as reference : 2
SNP Transitions/Transversions: 2.52 (41499/16451)
Total Het/Hom ratio : 1.48 (26788/18147)
SNP Het/Hom ratio : 1.49 (24738/16594)
MNP Het/Hom ratio : - (0/0)
Insertion Het/Hom ratio : 1.14 (987/865)
Deletion Het/Hom ratio : 1.52 (1045/688)
Indel Het/Hom ratio : - (18/0)
Insertion/Deletion ratio : 1.07 (1852/1733)
Indel/SNP+MNP ratio : 0.09 (3603/41332)
HG004:
Location : /data/HG004.vcf.gz
Failed Filters : 15379
Passed Filters : 45317
SNPs : 41684
MNPs : 0
Insertions : 1863
Deletions : 1751
Indels : 19
Same as reference : 0
SNP Transitions/Transversions: 2.55 (41510/16309)
Total Het/Hom ratio : 1.56 (27649/17668)
SNP Het/Hom ratio : 1.59 (25567/16117)
MNP Het/Hom ratio : - (0/0)
Insertion Het/Hom ratio : 1.12 (986/877)
Deletion Het/Hom ratio : 1.60 (1077/674)
Indel Het/Hom ratio : - (19/0)
Insertion/Deletion ratio : 1.06 (1863/1751)
Indel/SNP+MNP ratio : 0.09 (3633/41684)
sudo docker pull jmcdani20/hap.py:v0.3.12
declare -a trio=(HG002 HG003 HG004)
for SAMPLE in "${trio[@]}"
do
sudo docker run -i \
-v "${DIR}":"/data" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
"/data/${SAMPLE}_truth.vcf.gz" \
"/data/${SAMPLE}.vcf.gz" \
-f "/data/${SAMPLE}_truth.bed" \
-T "/data/${CAPTURE_BED}" \
-r "/data/hs37d5.fa" \
-o "/data/${SAMPLE}.happy.output" \
--engine=vcfeval \
--pass-only > ${DIR}/${SAMPLE}.stdout
done
Accuracy F1 scores:
Sample | Indel | SNP |
---|---|---|
HG002 | 0.970051 | 0.993609 |
HG003 | 0.967020 | 0.992973 |
HG004 | 0.973377 | 0.993597 |