-
Notifications
You must be signed in to change notification settings - Fork 1
GATK
This is adding something to the bam files, so we'll do this in your 3_bwa folder
Load java and samtools
spack load [email protected]_10%[email protected]
spack load [email protected]
Use picard tools to add read groups
java -jar /pickett_shared/software/picard-2.27.4/picard.jar AddOrReplaceReadGroups \
I=<your_read_set>_sorted.bam \
O=<your_read_set>_sorted.RG.bam \
RGSM=<your_read_set> \
RGLB=<your_read_set> \
RGPL=illumina \
RGPU=<your_read_set>
- I = input file (bam or sam)
- O = output file (sam or bam or cram)
- RGSM = Read-Group sample name Required.
- RGLB = Read-Group library Required.
- RGPL = Read-Group platform (e.g. ILLUMINA, SOLID) Required.
- RGPU = Read-Group platform unit (eg. run barcode) Required.
[Lots more information on this step] (https://gatk.broadinstitute.org/hc/en-us/articles/360035890671?id=6472)
We are cheating by using sample name for barcode, because it does not matter in our downstream analysis
Check size. If all looks good, you can now remove the original bam file.
Now index
samtools index <your_read_set>_sorted.RG.bam
An example loop to process all samples:
for f in *_sorted.bam
do
BASE=$( basename $f | sed 's/_sorted.bam*//g' )
echo "BASE $BASE"
java -jar /pickett_shared/software/picard-2.27.4/picard.jar \
AddOrReplaceReadGroups \
I=${BASE}_sorted.bam \
O=${BASE}_sorted.RG.bam \
RGSM=$BASE \
RGLB=$BASE \
RGPL=illumina \
RGPU=$BASE
samtools index ${BASE}_sorted.RG.bam
done
GATK4 has a combined tool to mark and sort duplicates called MarkDuplicateSpark tool. The sorts and marks duplicates based on query name and is different from Picard MarkDuplicate tool as it is coordinate sorted. * We will not mark dups because of the data type*.
mkdir 4_gatk
cd 4_gatk
# Create a links to BAM file with read groups, its index, and the reference genome
ln -s ../3_bwa/<your_read_set>_sorted.RG.bam .
ln -s ../3_bwa/<your_read_set>_sorted.RG.bam.bai .
ln -s /pickett_shared/teaching/EPP622_Fall2022/raw_data/solenopsis_invicta/genome/UNIL_Sinv_3.0.fasta
Prepare Reference for GATK
/pickett_shared/software/gatk-4.2.6.1/gatk CreateSequenceDictionary -R UNIL_Sinv_3.0.fasta
samtools faidx UNIL_Sinv_3.0.fasta
Outputs a .dict file
Call SNPs and indels on a single sample.
/pickett_shared/software/gatk-4.2.6.1/gatk HaplotypeCaller \
-R UNIL_Sinv_3.0.fasta \
-I <your_read_set>_sorted.RG.bam \
-O <your_read_set>.vcf
Uses 4 threads by default.
But what we actually want to do is call the SNPs/indels using GVCF mode. This will allow us to later recall all the SNPs/indels using information combined across samples. We will also request a bam file with the realigned reads to view.
/pickett_shared/software/gatk-4.2.6.1/gatk HaplotypeCaller \
-R UNIL_Sinv_3.0.fasta \
-I <your_read_set>_sorted.RG.bam \
-O <your_read_set>.g.vcf \
-ERC GVCF \
-bamout <your_read_set>_sorted.RG.realigned.bam
Compare the .vcf and the .g.vcf files. How are they different?
Copy your gvcf output file to a common directory
cp <your_read_set>.g.vcf
/pickett_shared/teaching/EPP622_Fall2022/analysis/GCVFs
Example for loop:
for f in *_sorted.RG.bam
do
BASE=$( basename $f | sed 's/_sorted.RG.bam*//g' )
echo "BASE $BASE"
/pickett_shared/software/gatk-4.2.6.1/gatk HaplotypeCaller \
-R UNIL_Sinv_3.0.fasta \
-I ${BASE}_sorted.RG.bam \
-O ${BASE}.g.vcf \
-ERC GVCF \
-bamout ${BASE}_sorted.RG.realigned.bam
done
Copy these files to your computer
- reference genome
- reference genome index (*fai)
- <your_read_set>_sorted.RG.bam
- <your_read_set>_sorted.RG.bam.bai
- <your_read_set>_sorted.RG.realigned.bam
- <your_read_set>_sorted.RG.realigned.bam.bai
- <your_read_set>.vcf
View on the online IGV.
- Are the SNPs in the vcf file supported by read evidence?
- Can you find an example of read realignment by GATK?
mkdir 5_gatk_combine
cd 5_gatk_combine/
Lets look at how many samples we have.
ls /pickett_shared/teaching/EPP622_Fall2022/analysis/GVCFs
We are going to use CombineGVCFs. But if you have a very large study, you should consider GenomicsDBImport.
CombineGVCFs wants every individual gvcf listed in a specific way with its own parameter flag. Lets write a for loop to generate that text
ln -s /pickett_shared/teaching/EPP622_Fall2022/analysis/GVCFs/*g.vcf .
echo "/pickett_shared/software/gatk-4.2.6.1/gatk CombineGVCFs \\
-R UNIL_Sinv_3.0.fasta \\"
for f in *g.vcf
do
echo "-variant $f \\"
done
echo "-O solenopsis_combined.g.vcf.gz"
That should give you something similar to