Skip to content

Trimmomatic

Meg Staton edited this page Sep 14, 2020 · 14 revisions

Organize your directory

this lab needs 6Gb of RAM to run. (In the future consider not using trimmomatic! It uses more memory than it should. I couldn't get skewer installed but perhaps sickle would be a good option to try).

Qsub to the acf, then get an interactive session with a computational node.

qsub -I -A ACF-UTK0138 -q debug -l walltime=1:00:00,nodes=1:ppn=3

Check that you are on a compute node.

uname -a

Go to your personal project folder in the class directory. You should have an e_coli folder with the following structure.

New dir. Keep it clean and organized. Start from the analysis folder.

mkdir 2_trimmomatic
cd 2_trimmomatic

We'll need to symbolically link to our reads again

ln -fs ../../../../e_coli_data/*gz .
ls -l

Trimming

Software links:

While the ACF system has Trimmomatic available as a module, its a bit out of date. So I put an up-to-date copy for us to use in our class project directory. We'll need to use the full path to call this program.

Now lets run trimmomatic for both adapter removal and quality on our first pair of reads:

java -jar /lustre/haven/proj/UTK0138/software/Trimmomatic-0.39/trimmomatic-0.39.jar PE \
SRR2584863_1.fastq.gz \
SRR2584863_2.fastq.gz \
SRR2584863_1.trimmed.paired.fastq \
SRR2584863_1.trimmed.unpaired.fastq \
SRR2584863_2.trimmed.paired.fastq \
SRR2584863_2.trimmed.unpaired.fastq \
ILLUMINACLIP:/lustre/haven/proj/UTK0138/software/Trimmomatic-0.39/adapters/NexteraPE-PE.fa:2:30:10:2:keepBothReads \
SLIDINGWINDOW:4:15 \
MINLEN:36 

Let's break that command down:

PE                                            # Paired end mode
SRR2584863_1.fastq.gz                         # forward read file
SRR2584863_2.fastq.gz                         # reverse read file
SRR2584863_1.trimmed.paired.fastq       # output - forward reads, trimmed and still part of a pair
SRR2584863_1.trimmed.unpaired.fastq     # output - forward reads, trimmed but not part of a pair
SRR2584863_2.trimmed.paired.fastq             # output - reverse reads, trimmed and still part of a pair
SRR2584863_2.trimmed.unpaired.fastq           # output - reverse reads, trimmed but not part of a pair
ILLUMINACLIP:/lustre/haven/proj/UTK0138/software/Trimmomatic-0.39/adapters/NexteraPE-PE.fa:2:30:10:2:keepBothReads  
                                              # adapter to trim and keep reads even if there is adapter read through
SLIDINGWINDOW:4:15                            # trim bases if the average quality over a 4 base window is less than 15
MINLEN:36                                     # discard reads if they are less than 36 bases

Now is the time to save that command into a file called commands.sh.

So what did Trimmomatic do? What are some ways you can tell?

  • look at what it printed to the command line
  • count the sequences before and after
  • rerun fastQC and look at the detailed report

Lets do the next set of files and also save the output of the command to a file

java -jar /lustre/haven/proj/UTK0138/software/Trimmomatic-0.39/trimmomatic-0.39.jar PE \
SRR2584866_1.fastq.gz \
SRR2584866_2.fastq.gz \
SRR2584866_1.trimmed.paired.fastq \
SRR2584866_1.trimmed.unpaired.fastq \
SRR2584866_2.trimmed.paired.fastq \
SRR2584866_2.trimmed.unpaired.fastq \
ILLUMINACLIP:/lustre/haven/proj/UTK0138/software/Trimmomatic-0.39/adapters/NexteraPE-PE.fa:2:30:10:2:keepBothReads SLIDINGWINDOW:4:15 \
MINLEN:36 \
> SRR2584866.trim.out

Don't forget to save that command into a file called commands.sh.

Why did it still output some lines to the terminal???

There are two types of output streams - standard out (stdout) and standard error (stderr). If you want to save stdout only, you use >. If you want to save both, you can use >&. Tons more info on stdout, stderr and stdin if you are interested

Lets try the >& on the third and final set of files.

java -jar /lustre/haven/proj/UTK0138/software/Trimmomatic-0.39/trimmomatic-0.39.jar PE \
SRR2589044_1.fastq.gz \
SRR2589044_2.fastq.gz \
SRR2589044_1.trimmed.paired.fastq \
SRR2589044_1.trimmed.unpaired.fastq \
SRR2589044_2.trimmed.paired.fastq \
SRR2589044_2.trimmed.unpaired.fastq \
ILLUMINACLIP:/lustre/haven/proj/UTK0138/software/Trimmomatic-0.39/adapters/NexteraPE-PE.fa:2:30:10:2:keepBothReads SLIDINGWINDOW:4:15 \
MINLEN:36 \
>& SRR2589044.trim.out

Don't forget to save that command into a file called commands.sh.

Now we have managed to save the terminal output in a file.

cat SRR2589044.trim.out

Compare the before and after quality

You can now run fastqc on the paired, trimmed files and compare to the raw files.

module load fastqc
fastqc -t 2 -o . *.trimmed.paired.fastq

I've done this for you, and also run multiqc. Lets compare them side by side. to see the HTML, right click on the link, download the file, then open it by double clicking:

How to make a for loop for this

First, lets get our trimmed files out of the way, so we can work on a script to redo them. You can delete them or move them to a new directory called something like saved.

This task is a bit complicated. We need to take the input file name (for example, "SRR2589044_1.fastq.gz") , trim off the _1.fastq.gz part, then use the rest to make the output file names (for example, "SRR2589044_1.trimmed.paired.fastq"). How to do this?

We can use a command line utility called sed, which stands for "a stream editor". Sed does lots of cool stuff! We can use it for the above mentioned trimming.

Lets try a quick script to see an example. Open a script called sed.sh

nano sed.sh

Put this in the script

file="SRR2589044_1.fastq.gz"
echo $file

trimmedfile=$(echo "$file" | sed 's/_1.fastq.gz//')
echo $trimmedfile

Run like this:

bash sed.sh

We can see it works to alter the variable by trimming off .fastq.gz. Now let's build the for loop. Start small:

for f in *_1.fastq.gz
do
        echo $f
        basename=$(echo "$f" | sed 's/_1.fastq.gz//')
        echo $basename
done

Now add the trimming loop, but keep echoing instead of running:

for f in *_1.fastq.gz
do
        echo $f
        basename=$(echo "$f" | sed 's/_1.fastq.gz//')
        echo $basename

        echo "java -jar /lustre/haven/proj/UTK0138/software/Trimmomatic-0.39/trimmomatic-0.39.jar PE \
        ${basename}_1.fastq.gz \
        ${basename}_2.fastq.gz \
        ${basename}_1.trimmed.paired.fastq \
        ${basename}_1.trimmed.unpaired.fastq \
        ${basename}_2.trimmed.paired.fastq \
        ${basename}_2.trimmed.unpaired.fastq \
        ILLUMINACLIP:/lustre/haven/proj/UTK0138/software/Trimmomatic-0.39/adapters/NexteraPE-PE.fa:2:30:10:2:keepBothReads SLIDINGWINDOW:4:15 \
        MINLEN:36 \
        >& ${basename}.trim.out" 
done

Once you like the output, actually run them (remove the echo):

for f in *_1.fastq.gz
do
        basename=$(echo "$f" | sed 's/_1.fastq.gz//')
        echo $basename

        java -jar /lustre/haven/proj/UTK0138/software/Trimmomatic-0.39/trimmomatic-0.39.jar PE \
        ${basename}_1.fastq.gz \
        ${basename}_2.fastq.gz \
        ${basename}_1.trimmed.paired.fastq \
        ${basename}_1.trimmed.unpaired.fastq \
        ${basename}_2.trimmed.paired.fastq \
        ${basename}_2.trimmed.unpaired.fastq \
        ILLUMINACLIP:/lustre/haven/proj/UTK0138/software/Trimmomatic-0.39/adapters/NexteraPE-PE.fa:2:30:10:2:keepBothReads SLIDINGWINDOW:4:15 \
        MINLEN:36 \
        >& ${basename}.trim.out
done