-
Notifications
You must be signed in to change notification settings - Fork 1
GATK
We are following GATK Best practices described here: https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-
This is adding something to the bam files, so we'll do this in your 3_bwa folder
Load java and samtools
spack load [email protected]_10%[email protected]
spack load [email protected]
Use picard tools to add read groups
java -jar /pickett_shared/software/picard-2.27.4/picard.jar AddOrReplaceReadGroups \
I=<your_read_set>_sorted.bam \
O=<your_read_set>_sorted.RG.bam \
RGSM=<your_read_set> \
RGLB=<your_read_set> \
RGPL=illumina \
RGPU=<your_read_set>
- I = input file (bam or sam)
- O = output file (sam or bam or cram)
- RGSM = Read-Group sample name Required.
- RGLB = Read-Group library Required.
- RGPL = Read-Group platform (e.g. ILLUMINA, SOLID) Required.
- RGPU = Read-Group platform unit (eg. run barcode) Required.
[Lots more information on this step] (https://gatk.broadinstitute.org/hc/en-us/articles/360035890671?id=6472)
We are cheating by using sample name for barcode, because it does not matter in our downstream analysis
Check size. If all looks good, you can now remove the original bam file.
Now index
samtools index <your_read_set>_sorted.RG.bam
An example loop to process all samples:
for f in *_sorted.bam
do
BASE=$( basename $f | sed 's/_sorted.bam*//g' )
echo "BASE $BASE"
java -jar /pickett_shared/software/picard-2.27.4/picard.jar \
AddOrReplaceReadGroups \
I=${BASE}_sorted.bam \
O=${BASE}_sorted.RG.bam \
RGSM=$BASE \
RGLB=$BASE \
RGPL=illumina \
RGPU=$BASE
samtools index ${BASE}_sorted.RG.bam
done
GATK4 has a combined tool to mark and sort duplicates called MarkDuplicateSpark tool. The sorts and marks duplicates based on query name and is different from Picard MarkDuplicate tool as it is coordinate sorted. * We will not mark dups because of the data type*.
mkdir 4_gatk
cd 4_gatk
# Create a links to BAM file with read groups, its index, and the reference genome
ln -s ../3_bwa/<your_read_set>_sorted.RG.bam .
ln -s ../3_bwa/<your_read_set>_sorted.RG.bam.bai .
ln -s /pickett_shared/teaching/EPP622_Fall2022/raw_data/solenopsis_invicta/genome/UNIL_Sinv_3.0.fasta
Prepare Reference for GATK
/pickett_shared/software/gatk-4.2.6.1/gatk CreateSequenceDictionary -R UNIL_Sinv_3.0.fasta
samtools faidx UNIL_Sinv_3.0.fasta
Outputs a .dict file
Call SNPs and indels on a single sample.
/pickett_shared/software/gatk-4.2.6.1/gatk HaplotypeCaller \
-R UNIL_Sinv_3.0.fasta \
-I <your_read_set>_sorted.RG.bam \
-O <your_read_set>.vcf
Uses 4 threads by default.
But what we actually want to do is call the SNPs/indels using GVCF mode. This will allow us to later recall all the SNPs/indels using information combined across samples. We will also request a bam file with the realigned reads to view.
/pickett_shared/software/gatk-4.2.6.1/gatk HaplotypeCaller \
-R UNIL_Sinv_3.0.fasta \
-I <your_read_set>_sorted.RG.bam \
-O <your_read_set>.g.vcf \
-ERC GVCF \
-bamout <your_read_set>_sorted.RG.realigned.bam
Compare the .vcf and the .g.vcf files. How are they different?
Copy your gvcf output file to a common directory
cp <your_read_set>.g.vcf
/pickett_shared/teaching/EPP622_Fall2022/analysis/GCVFs
Example for loop:
for f in *_sorted.RG.bam
do
BASE=$( basename $f | sed 's/_sorted.RG.bam*//g' )
echo "BASE $BASE"
/pickett_shared/software/gatk-4.2.6.1/gatk HaplotypeCaller \
-R UNIL_Sinv_3.0.fasta \
-I ${BASE}_sorted.RG.bam \
-O ${BASE}.g.vcf \
-ERC GVCF \
-bamout ${BASE}_sorted.RG.realigned.bam
done
Copy these files to your computer
- reference genome
- reference genome index (*fai)
- <your_read_set>_sorted.RG.bam
- <your_read_set>_sorted.RG.bam.bai
- <your_read_set>_sorted.RG.realigned.bam
- <your_read_set>_sorted.RG.realigned.bam.bai
- <your_read_set>.vcf
View on the online IGV.
- Are the SNPs in the vcf file supported by read evidence?
- Can you find an example of read realignment by GATK?
mkdir 5_gatk_combine
cd 5_gatk_combine/
ln -s /pickett_shared/teaching/EPP622_Fall2022/raw_data/solenopsis_invicta/genome/UNIL_Sinv_3.0.fasta
ln -s ../4_gatk/UNIL_Sinv_3.0.fasta.fai
ln -s ../4_gatk/UNIL_Sinv_3.0.dict .
Lets look at how many samples we have.
ls /pickett_shared/teaching/EPP622_Fall2022/analysis/GVCFs
We are going to use CombineGVCFs. But if you have a very large study, you should consider GenomicsDBImport.
CombineGVCFs wants every individual gvcf listed in a specific way with its own parameter flag. Lets write a for loop to generate that text
ln -s /pickett_shared/teaching/EPP622_Fall2022/analysis/GVCFs/*g.vcf .
echo "/pickett_shared/software/gatk-4.2.6.1/gatk CombineGVCFs \\
-R UNIL_Sinv_3.0.fasta \\"
for f in *g.vcf
do
echo "-variant $f \\"
done
echo "-O solenopsis_combined.g.vcf.gz"
That should give you something similar to
/pickett_shared/software/gatk-4.2.6.1/gatk CombineGVCFs \
-R UNIL_Sinv_3.0.fasta \
-variant SRR6922148.g.vcf \
-variant SRR6922194.g.vcf \
-variant SRR6922233.g.vcf \
-variant SRR6922241.g.vcf \
-variant SRR6922276.g.vcf \
-variant SRR6922278.g.vcf \
-variant SRR6922291.g.vcf \
-variant SRR6922294.g.vcf \
-variant SRR6922306.g.vcf \
-variant SRR6922308.g.vcf \
-variant SRR6922309.g.vcf \
-variant SRR6922311.g.vcf \
-variant SRR6922314.g.vcf \
-variant SRR6922315.g.vcf \
-variant SRR6922316.g.vcf \
-variant SRR6922318.g.vcf \
-variant SRR6922319.g.vcf \
-variant SRR6922320.g.vcf \
-variant SRR6922321.g.vcf \
-variant SRR6922354.g.vcf \
-variant SRR6922399.g.vcf \
-variant SRR6922446.g.vcf \
-variant SRR6922447.g.vcf \
-variant SRR6922448.g.vcf \
-variant SRR6922449.g.vcf \
-variant SRR6922451.g.vcf \
-variant SRR6922454.g.vcf \
-variant SRR6922470.g.vcf \
-O solenopsis_combined.g.vcf.gz
Run that. You should now have a solenopsis_combined.g.vcf.gz file. Lets recall all the variants
/pickett_shared/software/gatk-4.2.6.1/gatk --java-options "-Xmx10g" GenotypeGVCFs \
-R UNIL_Sinv_3.0.fasta \
-V solenopsis_combined.g.vcf.gz \
-O solenopsis_combined.vcf.gz
Lets look at how many snps we have
spack load bcftools
bcftools stats solenopsis_combined.vcf.gz > solenopsis_combined.vcf.stats.txt
These SNPs are unfiltered and probably have many false positives. The next recommended step is variant quality score recalibration (VQSR), however, that depends on high-quality sets of known variants to use as training. We don't have that, so we will do some hard threshold filtering. Lets start with a basic depth and quality value filtering
/pickett_shared/software/gatk-4.2.6.1/gatk VariantFiltration \
-R UNIL_Sinv_3.0.fasta \
-V solenopsis_combined.vcf.gz \
-O solenopsis_combined.Basicfilters.vcf.gz \
-filter-name "basic_filter" -filter "QUAL > 20.0 && DP > 5"
Here are recommended filtering parameters from GATK with more explanation here. We don't have time but you would ideally split out indels and snps, then apply independent filters as described in the article.
/pickett_shared/software/gatk-4.2.6.1/gatk VariantFiltration \
-R UNIL_Sinv_3.0.fasta \
-V solenopsis_combined.vcf.gz \
-O solenopsis_combined.GATKfilters.vcf.gz \
-filter-name "QD_filter" -filter "QD < 2.0" \
-filter-name "FS_filter" -filter "FS > 60.0" \
-filter-name "MQ_filter" -filter "MQ < 40.0" \
-filter-name "SOR_filter" -filter "SOR > 4.0" \
-filter-name "MQRankSum" -filter "MQRankSum < -12.5" \
-filter-name "ReadPosRankSum" -filter "ReadPosRankSum < -8.0"
The latter two give a warning, but its still working fine according to the GATK forum.
- QD = QualByDepth
- FS = FisherStrand - filter for strand bias
- SOR = StrandOddsRatio
- MQ = RMSMappingQuality
- MQRankSum = MappingQualityRankSumTest
- ReadPosRankSum
Use bcftools stats to look at your filtered files. how are they different from the unfiltered and from each ohter? The tool only marks sites that do not pass, it does not remove them from the file. So lets get stats on only passing variants
bcftools stats -f "PASS,." solenopsis_combined.vcf.gz >solenopsis_combined.vcf.stats.txt
bcftools stats -f "PASS,." solenopsis_combined.Basicfilters.vcf.gz >solenopsis_combined.Basicfilters.vcf.stats.txt
bcftools stats -f "PASS,." solenopsis_combined.GATKfilters.vcf.gz >solenopsis_combined.GATKfilters.vcf.stats.txt
grep 'number of SNPs:' *stats.txt