Skip to content

Lab: DNA I Read QC and trimming

Meg Staton edited this page Sep 19, 2022 · 7 revisions

1. Finding and assigning fastq data files

The S. invicta RAD data we are using for the labs comes from this project. We have already downloaded all the reads, which can be found here: /pickett_shared/teaching/EPP622_Fall2022/raw_data/solenopsis_invicta. Please view the below table, as every student has been assigned a read for the duration of the DNA I unit.

Student Read Accession Total # Reads Location
Jamie Alumbaugh SRR6922148 1281077 Oglethorpe Co, GA
Cassidy Catrett SRR6922294 1405770 Oglethorpe Co, GA
Jenna Demeter SRR6922306 1541579 Oglethorpe Co, GA
Presley Dowker SRR6922308 2987905 Oglethorpe Co, GA
Nick Gill SRR6922451 1731620 Oglethorpe Co, GA
Axel Gonzalez Murillo SRR6922454 2897067 Oglethorpe Co, GA
Rebecca Butler SRR6922449 3649330 Oglethorpe Co, GA
Eunice Omondi SRR6922311 258861 Oglethorpe Co, GA
Madison Henniger SRR6922354 1344896 Pascagoula, MS
Katelin Hubbard SRR6922399 1091567 Pascagoula, MS
Harleen Kaur SRR6922194 1100979 Alejandra, Argentina
Peitong Li SRR6922233 1216598 Alejandra, Argentina
Leif Majeres SRR6922241 1148981 Alejandra, Argentina
Dillon McCallum SRR6922315 1017635 Alejandra, Argentina
Ruwaa Mohamed SRR6922318 1592618 Alejandra, Argentina
Shade Niece SRR6922319 2199106 Alejandra, Argentina
Beatrice Caiado SRR6922321 1696637 Alejandra, Argentina
Rachel Robrecht SRR6922446 991054 Alejandra, Argentina
Raj Roy SRR6922447 1957684 Alejandra, Argentina
Rebecca Smith SRR6922448 1053520 Alejandra, Argentina
Jennifer Chandler SRR6922316 913465 Alejandra, Argentina
Garrett Franklin SRR6922314 533098 Alejandra, Argentina
Triston Walsh SRR6922320 683277 Alejandra, Argentina
Zane Smith SRR6922276 1111463 El Recreo, Argentina
Peter Tandy SRR6922278 1223105 El Recreo, Argentina
Bryce Trull SRR6922291 1154914 El Recreo, Argentina
Jackson Turner SRR6922309 1771519 El Recreo, Argentina
Instructor SRR6922470 1548618 El Recreo, Argentina

2. Setting up a personal directory

Go to the analysis directory within the EPP 622 course directory:

/pickett_shared/teaching/EPP622_Fall2022/analysis

...and make a personal analysis folder. For example:

mkdir <your user id goes here>
cd <your user id goes here>

3. Running fastqc

Now, let's make a directory where we will run`fastqc:

mkdir 1_fastqc
cd 1_fastqc

We can create a soft link (symbolic link) to the raw data

ln -s ../../../raw_data/solenopsis_invicta/<your subset>.fastq .

Let's load fastqc:

spack load fastqc

Let's run the program now. Since, we all are sharing the same computing resource, we will run fastqc on just one forward read fastq file -

fastqc <your subset>

This program outputs results in`.zip and .html formats. We can't inspect them on sphinx, so we'll need to copy them to our own devices.

scp '<your_username>@sphinx.ag.utk.edu:/pickett_shared/teaching/EPP622_Fall2022/analysis/<your_username>/1_fastqc/*html' ./

4. Running Skewer

skewer github

Skewer is a fast and accurate adapter trimmer for next-generation sequencing paired-end reads. It has several features such as detecting and removing adapter sequences, trimming sequences based on phred quality scores etc. Now, go to your personal analyses directory and make a new directory -

cd /pickett_shared/teaching/EPP622_Fall2022/analysis/<your name>
mkdir 2_skewer
cd 2_skewer

Soft link the raw data files here, too (the space is free!)

ln -s ../../../raw_data/solenopsis_invicta/<your subset>.fastq .

Skewer is installed locally on sphinx therefore, we won't have to use Spack to load it this time.

/sphinx_local/software/skewer/skewer -t 2 -l 95 -x AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG -Q 30 <your subset> -o <outfile name>

-t stands for number of threads used by this command
-l stands for minimum length of sequence we want to keep in our analyses
-Q is the minimum mean quality score (Phred score) of the sequence (across the entire read length)

Note: Here we use Q 30 as an illustrative example because the data is already very high quality. In some instances, Q 30 may be considered on the more strict end of trimming thresholds (see Del Fabbro et al. 2013).

Say you wanted to trim all the files using a for loop. Here is an example of how to do that:

for f in *fastq
do
	BASE=$( basename $f | sed 's/_1.fastq//g')
	echo $BASE

	/sphinx_local/software/skewer/skewer \
	-t 2 -l 95 -Q 30 \
	-x AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG \
	$f -o $BASE 
done

5. Run fastqc on trimmed files

Now that we have trimmed our sequence file, let's check it's quality using fastqc. So, go back to your personal directory and make a new directory -

fastqc <your subset>-trimmed.fastq