Skip to content
Meg Staton edited this page Sep 19, 2022 · 9 revisions

Add Read Groups to BAM file.

This is adding something to the bam files, so we'll do this in your 3_bwa folder

Load java and samtools

Use picard tools to add read groups

java -jar /pickett_shared/software/picard-2.27.4/picard.jar AddOrReplaceReadGroups \
	I=<your_read_set>_sorted.bam \
	O=<your_read_set>_sorted.RG.bam \
	RGSM=<your_read_set> \
	RGLB=<your_read_set> \
	RGPL=illumina \
	RGPU=<your_read_set>
  • I = input file (bam or sam)
  • O = output file (sam or bam or cram)
  • RGSM = Read-Group sample name Required.
  • RGLB = Read-Group library Required.
  • RGPL = Read-Group platform (e.g. ILLUMINA, SOLID) Required.
  • RGPU = Read-Group platform unit (eg. run barcode) Required.

[Lots more information on this step] (https://gatk.broadinstitute.org/hc/en-us/articles/360035890671?id=6472)

We are cheating by using sample name for barcode, because it does not matter in our downstream analysis

Check size. If all looks good, you can now remove the original bam file.

Now index

samtools index <your_read_set>_sorted.RG.bam

An example loop to process all samples:

for f in *_sorted.bam
do
        BASE=$( basename $f | sed 's/_sorted.bam*//g' )
        echo "BASE $BASE"
        
	java -jar /pickett_shared/software/picard-2.27.4/picard.jar \
		AddOrReplaceReadGroups \
		I=${BASE}_sorted.bam \
		O=${BASE}_sorted.RG.bam \
		RGSM=$BASE \
		RGLB=$BASE \
		RGPL=illumina \
		RGPU=$BASE
	samtools index ${BASE}_sorted.RG.bam
done

GATK4 has a combined tool to mark and sort duplicates called MarkDuplicateSpark tool. The sorts and marks duplicates based on query name and is different from Picard MarkDuplicate tool as it is coordinate sorted. * We will not mark dups because of the data type*.

GATK

mkdir 4_gatk
cd 4_gatk

# Create a links to BAM file with read groups, its index, and the reference genome
ln -s ../3_bwa/<your_read_set>_sorted.RG.bam .
ln -s ../3_bwa/<your_read_set>_sorted.RG.bam.bai .
ln -s /pickett_shared/teaching/EPP622_Fall2022/raw_data/solenopsis_invicta/genome/UNIL_Sinv_3.0.fasta

Prepare Reference for GATK

/pickett_shared/software/gatk-4.2.6.1/gatk CreateSequenceDictionary -R UNIL_Sinv_3.0.fasta
samtools faidx UNIL_Sinv_3.0.fasta

Outputs a .dict file

Call SNPs and indels on a single sample.

/pickett_shared/software/gatk-4.2.6.1/gatk HaplotypeCaller \
        -R UNIL_Sinv_3.0.fasta \
        -I <your_read_set>_sorted.RG.bam \
        -O <your_read_set>.vcf

Uses 4 threads by default.

But what we actually want to do is call the SNPs/indels using GVCF mode. This will allow us to later recall all the SNPs/indels using information combined across samples. We will also request a bam file with the realigned reads to view.

/pickett_shared/software/gatk-4.2.6.1/gatk HaplotypeCaller \
        -R UNIL_Sinv_3.0.fasta \
        -I <your_read_set>_sorted.RG.bam \
        -O <your_read_set>.g.vcf \
        -ERC GVCF \
        -bamout <your_read_set>_sorted.RG.realigned.bam

Compare the .vcf and the .g.vcf files. How are they different?

Copy your gvcf output file to a common directory

cp <your_read_set>.g.vcf 
/pickett_shared/teaching/EPP622_Fall2022/analysis/GCVFs

Example for loop:

for f in *_sorted.RG.bam
do
        BASE=$( basename $f | sed 's/_sorted.RG.bam*//g' )
        echo "BASE $BASE"

        /pickett_shared/software/gatk-4.2.6.1/gatk HaplotypeCaller \
        -R UNIL_Sinv_3.0.fasta \
        -I ${BASE}_sorted.RG.bam \
        -O ${BASE}.g.vcf \
        -ERC GVCF \
        -bamout ${BASE}_sorted.RG.realigned.bam

done

IGV

Copy these files to your computer

  • reference genome
  • reference genome index (*fai)
  • <your_read_set>_sorted.RG.bam
  • <your_read_set>_sorted.RG.bam.bai
  • <your_read_set>_sorted.RG.realigned.bam
  • <your_read_set>_sorted.RG.realigned.bam.bai
  • <your_read_set>.vcf

View on the online IGV.

  • Are the SNPs in the vcf file supported by read evidence?
  • Can you find an example of read realignment by GATK?

Combine and recall SNPs across the whole group of samples

mkdir 5_gatk_combine
cd 5_gatk_combine/

Lets look at how many samples we have.

ls /pickett_shared/teaching/EPP622_Fall2022/analysis/GVCFs

We are going to use CombineGVCFs. But if you have a very large study, you should consider GenomicsDBImport.

CombineGVCFs wants every individual gvcf listed in a specific way with its own parameter flag. Lets write a for loop to generate that text

ln -s /pickett_shared/teaching/EPP622_Fall2022/analysis/GVCFs/*g.vcf .

echo "/pickett_shared/software/gatk-4.2.6.1/gatk CombineGVCFs \\
        -R UNIL_Sinv_3.0.fasta \\"

for f in *g.vcf 
do
        echo "-variant $f \\"
done

echo "-O solenopsis_combined.g.vcf.gz"

That should give you something similar to


Clone this wiki locally