-
Notifications
You must be signed in to change notification settings - Fork 24
RNASeq Set Up and Read Trimming
Remember how to do that?
Go to your class directory
cd /lustre/haven/proj/UTK0138/username
mkdir apricot_rnaseq
cd apricot_rnaseq
create our project directory structure
mkdir raw_data
mkdir analysis
cd raw_data
Create symbolic links to the raw data. We'll also need the peach genome (we're actually only going to use chromosome 1)
ln -s /lustre/haven/proj/UTK0138/apricot_data/*fq .
ln -s /lustre/haven/proj/UTK0138/apricot_data/adapters.fasta .
ln -s /lustre/haven/proj/UTK0138/apricot_data/Ppersica_298* .
Examine the files
- How many sequences in each fasta and fastq file?
- What is a gff3?
We're going to use sickle (I know I said cutadapt in the lecture but I didn't want to mess with everyone having to install with conda).
Make a new directory, 1_trimming, in your analysis folder.
Lets try one file.
/lustre/haven/proj/UTK0138/software/sickle-1.33/sickle se \
-f ../../raw_data/LBtree10-400CH.fq \
-t sanger \
-o LBtree10-400CH.trimmed.fq
(Why are we using "sanger" instead of "Illumina" for -t? Because of the weird fastq encoding issues from 8-10 years ago. Any new Illumina data actually has Sanger encoding now).
It conveniently gives us a report on the number of reads discarded. But what if we want to know the number of bases discarded? Python to the rescue!
Lets set up a quick biopython script to count the sequences
from Bio import SeqIO
import re
import sys
file = sys.argv[1]
seq_count = 0
for seq_record in SeqIO.parse(file, "fastq"):
seq_count += 1
print("The number of sequences is: " + str(seq_count))
Run it on our new file
module load biopython/1.72
python fastq_stats.py LBtree10-400CH.trimmed.fq
We can add a base count and average length
from Bio import SeqIO
import re
import sys
file = sys.argv[1]
seq_count = 0
base_count = 0
for seq_record in SeqIO.parse(file, "fastq"):
seq_count += 1
base_count += len(str(seq_record.seq))
print("The number of sequences is: " + str(seq_count))
print("The number of bases is: " + str(base_count))
average_len = base_count/seq_count
print("The average bases per sequence is: " + str(average_len))
And run on the original and the new file to see how they have changed.
python fastq_stats.py LBtree10-400CH.trimmed.fq
python fastq_stats.py ../../raw_data/LBtree10-400CH.fq
We have 6 files to run, so we need to figure out some automation. (Could we run 6 samples one by one? Sure. But the original project published in Yu et al 2020 had 130 samples! So in real projects you likely want to automate each step to run for all the files.)
There are a lot of options and I want to spend some time exploring all of them.
- If you have an interactive session (or your own Linux machine), we can write a shell script with a for loop
- We can write a basic qsub script that uses the for loop
- What if there are still too many samples to run in one allocation? We can redo the qsub script with a task array. We'll learn that on Wednesday!
Starting with the for loop, in trim.sh:
for f in ../../raw_data/*fq
do
echo $f
done
Build up to get to the sample name
for f in ../../raw_data/*fq
do
echo $f
sample=$(basename $f | sed 's/.fq//')
echo $sample
done
Now we can build the sickle command
for f in ../../raw_data/*fq
do
sample=$(basename $f | sed 's/.fq//')
echo $sample
echo "/lustre/haven/proj/UTK0138/software/sickle-1.33/sickle se \
-f $f \
-t sanger \
-o $sample.trimmed.fq"
done
We could remove the echo and run it. But what if we want to qsub? Lets build that instead.
cp trim.sh trim.qsh
Add stuff at the top:
#PBS -S /bin/bash
#PBS -A ACF-UTK0138
#PBS -l nodes=1:ppn=2,walltime=01:00:00
#PBS -l advres=epp622.13776065
#PBS -N rna-trim
#PBS -m abe
#PBS -M [email protected]
cd $PBS_O_WORKDIR
for f in ../../raw_data/*fq
do
sample=$(basename $f | sed 's/.fq//')
echo $sample
/lustre/haven/proj/UTK0138/software/sickle-1.33/sickle se \
-f $f \
-t sanger \
-o $sample.trimmed.fq
done
Now we can qsub it:
qsub trim.qsh
Check on the status with qsub
.
Do we have output files?
What if we wanted to produce a nice table for your paper with all the stats in one table. How can we modify our python script to do that?
First, let's learn how to take a list of files from the command line and iterate over them. Put in test.py:
import sys
for f in sys.argv[1:]:
print(f)
To run:
python test.py *fq
Neat! Now we can expand our original fastq_stats.py:
from Bio import SeqIO
import re
import sys
def get_stats(file_name):
seq_count = 0
base_count = 0
for seq_record in SeqIO.parse(file_name, "fastq"):
seq_count += 1
base_count += len(str(seq_record.seq))
return seq_count, base_count
print("File Name\t# Seqs\t# Bases\tAvg Seq Len")
for trimmed_file in sys.argv[1:]:
seqs, bases = get_stats(f)
print( f + "\t", end = "")
print(str(seqs) + "\t", end = "")
print(str(bases) + "\t", end = "")
average_len = bases/seqs
print(str(average_len))
And if we want to get really fancy, we can use our regex's to find the original file, then create a table with both the original and after trimming stats. (This uses re.sub which is a new use of regular expressions - find a pattern and replace it with something new.)
from Bio import SeqIO
import re
import sys
def get_stats(file_name):
seq_count = 0
base_count = 0
for seq_record in SeqIO.parse(file_name, "fastq"):
seq_count += 1
base_count += len(str(seq_record.seq))
return seq_count, base_count
print("File Name\tOrig # Seqs\tOrig # Bases\tOrig Avg Seq Len\tTrimmed # Seqs\tTrimmed # Bases\tTrimmed Avg Seq Len")
for trimmed_file in sys.argv[1:]:
orig_file = re.sub(r".trimmed", '', trimmed_file)
#print(trimmed_file)
#print(orig_file)
orig_seqs, orig_bases = get_stats("/lustre/haven/proj/UTK0138/apricot_data/"+orig_file)
trimmed_seqs, trimmed_bases = get_stats(trimmed_file)
print( orig_file + "\t", end = "")
print(str(orig_seqs) + "\t", end = "")
print(str(orig_bases) + "\t", end = "")
orig_average_len = bases/seqs
print(str(orig_average_len) + "\t", end="")
print(str(trimmed_seqs) + "\t", end = "")
print(str(trimmed_bases) + "\t", end = "")
trimmed_average_len = bases/seqs
print(str(trimmed_average_len))