Skip to content

Test 3 Question 1

Meg Staton edited this page Dec 11, 2020 · 2 revisions

test3 - q1_salmon

STAR - index and map

Indexing failed with 2 cores, get 3:

qsub -I -A ACF-UTK0138 -q debug -l walltime=1:00:00,nodes=1:ppn=3,advres=epp622.13776065

Index command:

/lustre/haven/proj/UTK0138/software/STAR-2.7.6a/bin/Linux_x86_64/STAR \
--runMode genomeGenerate \
--genomeDir STAR_idx \
--genomeFastaFiles Ppersica_298_v2.0.fa \
--runThreadN 1 \
--sjdbGTFfile Ppersica_298_v2.1.gene_exons.gff3 \
--sjdbGTFtagExonParentTranscript Parent \
--sjdbOverhang 50

Make filenames file:

ls ../*fq > filenames.txt

Run qsub task array job:

#PBS -S /bin/bash
#PBS -A ACF-UTK0138
#PBS -l nodes=1:ppn=2,walltime=01:00:00
#PBS -l advres=epp622.13776065
#PBS -N rna-map
#PBS -m abe
#PBS -M [email protected]
#PBS -t 1-6

cd $PBS_O_WORKDIR

infile=$(sed -n -e "${PBS_ARRAYID} p" filenames.txt)

samplename=$(basename $infile | sed 's/.fq//')

/lustre/haven/proj/UTK0138/software/STAR-2.7.6a/bin/Linux_x86_64/STAR \
--genomeDir ../STAR_idx \
--readFilesIn $infile \
--runThreadN 2 \
--outFileNamePrefix $samplename \
--outSAMtype BAM SortedByCoordinate

Count:

module load htseq
module load pysam

for f in *bam
do
    samplename=$(basename $f | sed 's/Aligned.sortedByCoord.out.bam//g')
    echo $base
    htseq-count \
    --format=bam \
    --order=pos \
    --stranded=no \
    --type=gene \
    --idattr=ID \
    $f \
    ../Ppersica_298_v2.1.gene_exons.gff3 \
    >${samplename}.counts.txt
done

Salmon

Index the transcripts:

/lustre/haven/proj/UTK0138/software/salmon-latest_linux_x86_64/bin/salmon \
index \
-t Ppersica_298_v2.1.transcript.fa \
-i salmon_idx

Map:

#PBS -S /bin/bash
#PBS -A ACF-UTK0138
#PBS -l nodes=1:ppn=2,walltime=00:20:00
#PBS -l advres=epp622.13776065
#PBS -N salmon-map
#PBS -m abe
#PBS -M [email protected]
#PBS -t 1-6

cd $PBS_O_WORKDIR

infile=$(sed -n -e "${PBS_ARRAYID} p" filenames.txt)

samplename=$(basename $infile | sed 's/.fq//')

/lustre/haven/proj/UTK0138/software/salmon-latest_linux_x86_64/bin/salmon \
quant \
-i ../salmon_idx \
-l A \
-r $infile \
--validateMappings \
-p 2 \
-o ${samplename}_transcripts_quant

Salmon doesn't give a direct, easy to find count of uniquely mapped and multi mapped reads. So first, get total mapped reads from the file lib_format_counts.json. It'll be a line like this:

    "num_assigned_fragments": 539298,

Next find the number of uniquely mapped fragments. Get it by adding up the first number on each line in the file ./aux_info/ambig_info.tsv. Python code for this:

import sys
import re

file = sys.argv[1]
fileH = open(file, "r")

total_uniq = 0

for line in fileH:
	m = re.search("^\d+", line)
	if m:
		uniq_cnt = m.group()
		total_uniq += int(uniq_cnt)

print("Total uniq " + str(total_uniq))

Subtract the unique count from the total count to get the multimapped count.