Skip to content

RNASeq Set Up and Read Trimming

Meg Staton edited this page Nov 1, 2020 · 13 revisions

Log into the ACF

Remember how to do that?

Project set up

Go to your class directory

cd /lustre/haven/proj/UTK0138/username
mkdir apricot_rnaseq
cd apricot_rnaseq

create our project directory structure

mkdir raw_data
mkdir analysis
cd raw_data

Create symbolic links to the raw data. We'll also need the peach genome (we're actually only going to use chromosome 1)

ln -s /lustre/haven/proj/UTK0138/apricot_data/fq* .
ln -s /lustre/haven/proj/UTK0138/apricot_data/adapters.fasta .
ln -s /lustre/haven/proj/UTK0138/apricot_data/Ppersica_298* .

Examine the files

  • How many sequences in each fasta and fastq file?
  • What is a gff3?

Trimming

We're going to use sickle (I know I said cutadapt in the lecture but I didn't want to mess with everyone having to install with conda).

Make a new directory, 1_trimming, in your analysis folder.

Lets try one file.

/lustre/haven/proj/UTK0138/software/sickle-1.33/sickle se \
-f ../../raw_data/LBtree10-400CH.fq \
-t sanger \
-o LBtree10-400CH.trimmed.fq

(Why are we using "sanger" instead of "Illumina" for -t? Because of the weird fastq encoding issues from 8-10 years ago. Any new Illumina data actually has Sanger encoding now).

It conveniently gives us a report on the number of reads discarded. But what if we want to know the number of bases discarded? Python to the rescue!

Python script for stats

Lets set up a quick biopython script to count the sequences

from Bio import SeqIO
import re
import sys

file = sys.argv[1]

seq_count = 0

for seq_record in SeqIO.parse(file, "fastq"):
        seq_count += 1

print("The number of sequences is: " + str(seq_count))

Run it on our new file

module load biopython/1.72
python fastq_stats.py LBtree10-400CH.trimmed.fq

We can add a base count and average length

from Bio import SeqIO
import re
import sys

file = sys.argv[1]

seq_count = 0
base_count = 0

for seq_record in SeqIO.parse(file, "fastq"):
	seq_count += 1
	base_count += len(str(seq_record.seq))

print("The number of sequences is: " + str(seq_count))
print("The number of bases is: " + str(base_count))
average_len = base_count/seq_count
print("The average bases per sequence is: " + str(average_len))

And run on the original and the new file to see how they have changed.

python fastq_stats.py LBtree10-400CH.trimmed.fq
python fastq_stats.py ../../raw_data/LBtree10-400CH.fq

Loop over all files

We have 6 files to run, so we need to figure out some automation. (Could we run 6 samples one by one? Sure. But the original project published in Yu et al 2020 had 130 samples! So in real projects you likely want to automate each step to run for all the files.)

There are a lot of options and I want to spend some time exploring all of them.

  1. If you have an interactive session (or your own Linux machine), we can write a shell script with a for loop
  2. We can write a basic qsub script that uses the for loop
  3. What if there are still too many samples to run in one allocation? We can redo the qsub script with a task array. We'll learn that on Wednesday!

Starting with the for loop, in trim.sh:

for f in ../../raw_data/*bam
do
    echo $f
done

Build up to get to the sample name

for f in ../../raw_data/*fq
do
    echo $f
    sample=$(basename $f | sed 's/.fq//')
    echo $sample
done

Now we can build the sickle command

for f in ../../raw_data/*fq
do
    sample=$(basename $f | sed 's/.fq//')
    echo $sample

    echo "/lustre/haven/proj/UTK0138/software/sickle-1.33/sickle se \
    -f $f \
    -t sanger \
    -o $sample.trimmed.fq"
done

We could remove the echo and run it. But what if we want to qsub? Lets build that instead.

Basic Qsub

mv trim.sh trim.qsh

Add stuff at the top:

#PBS -S /bin/bash
#PBS -A ACF-UTK0138
#PBS -l nodes=1:ppn=2,walltime=01:00:00
#PBS -l advres=epp622.13776065
#PBS -N rna-trim
#PBS -m abe
#PBS -M [email protected]

cd $PBS_O_WORKDIR

for f in ../../raw_data/*fq
do
    sample=$(basename $f | sed 's/.fq//')
    echo $sample

    /lustre/haven/proj/UTK0138/software/sickle-1.33/sickle se \
    -f $f \
    -t sanger \
    -o $sample.trimmed.fq
done

Now we can qsub it:

qsub trim.qsh

Check on the status with qsub.

Do we have output files?

More python

What if we wanted to produce a nice table for your paper with all the stats in one table. How can we modify our python script to do that?

First, let's learn how to take a list of files from the command line and iterate over them. Put in test.py:

import sys

for f in sys.argv[1:]:
        print(f)

To run:

python test.py *fq

Neat! Now we can expand our original fastq_stats.py:

from Bio import SeqIO
import re
import sys

def get_stats(file_name):
        seq_count = 0
        base_count = 0
        for seq_record in SeqIO.parse(file_name, "fastq"):
                seq_count += 1
                base_count += len(str(seq_record.seq))
        return seq_count, base_count

print("File Name\t# Seqs\t# Bases\tAvg Seq Len")

for trimmed_file in sys.argv[1:]:

        seqs, bases = get_stats(f)

        print( f + "\t", end = "")
        print(str(seqs) + "\t", end = "")
        print(str(bases) + "\t", end = "")
        average_len = bases/seqs
        print(str(average_len))

And if we want to get really fancy, we can use our regex's to find the original file, then create a table with both the original and after trimming stats. (This uses re.sub which is a new use of regular expressions - find a pattern and replace it with something new.)

from Bio import SeqIO
import re
import sys

def get_stats(file_name):
	seq_count = 0
	base_count = 0
	for seq_record in SeqIO.parse(file_name, "fastq"):
		seq_count += 1
		base_count += len(str(seq_record.seq))
	return seq_count, base_count

print("File Name\tOrig # Seqs\tOrig # Bases\tOrig Avg Seq Len\tTrimmed # Seqs\tTrimmed # Bases\tTrimmed Avg Seq Len")

for trimmed_file in sys.argv[1:]:

	orig_file = re.sub(r".trimmed", '', trimmed_file)
	#print(trimmed_file)
	#print(orig_file)

	orig_seqs, orig_bases = get_stats("/lustre/haven/proj/UTK0138/apricot_data/"+orig_file)
	trimmed_seqs, trimmed_bases = get_stats(f)

	print( orig_file + "\t", end = "")
	print(str(orig_seqs) + "\t", end = "")
	print(str(orig_bases) + "\t", end = "")
	orig_average_len = bases/seqs
	print(str(orig_average_len) + "\t", end="")

	print(str(trimmed_seqs) + "\t", end = "")
	print(str(trimmed_bases) + "\t", end = "")
	trimmed_average_len = bases/seqs
	print(str(trimmed_average_len))