-
Notifications
You must be signed in to change notification settings - Fork 1
Lab: DNA I Read QC and trimming
The S. invicta RAD data we are using for the labs comes from this project. We have already downloaded all the reads, which can be found here: /pickett_shared/teaching/EPP622_Fall2022/raw_data/solenopsis_invicta
. Please view the below table, as every student has been assigned a read for the duration of the DNA I unit.
Student | Read Accession | Total # Reads | Location |
---|---|---|---|
Jamie Alumbaugh | SRR6922148 | 1281077 | Oglethorpe Co, GA |
Cassidy Catrett | SRR6922294 | 1405770 | Oglethorpe Co, GA |
Jenna Demeter | SRR6922306 | 1541579 | Oglethorpe Co, GA |
Presley Dowker | SRR6922308 | 2987905 | Oglethorpe Co, GA |
Nick Gill | SRR6922451 | 1731620 | Oglethorpe Co, GA |
Axel Gonzalez Murillo | SRR6922454 | 2897067 | Oglethorpe Co, GA |
Rebecca Butler | SRR6922449 | 3649330 | Oglethorpe Co, GA |
Eunice Omondi | SRR6922311 | 258861 | Oglethorpe Co, GA |
Madison Henniger | SRR6922354 | 1344896 | Pascagoula, MS |
Katelin Hubbard | SRR6922399 | 1091567 | Pascagoula, MS |
Harleen Kaur | SRR6922194 | 1100979 | Alejandra, Argentina |
Peitong Li | SRR6922233 | 1216598 | Alejandra, Argentina |
Leif Majeres | SRR6922241 | 1148981 | Alejandra, Argentina |
Dillon McCallum | SRR6922315 | 1017635 | Alejandra, Argentina |
Ruwaa Mohamed | SRR6922318 | 1592618 | Alejandra, Argentina |
Shade Niece | SRR6922319 | 2199106 | Alejandra, Argentina |
Beatrice Caiado | SRR6922321 | 1696637 | Alejandra, Argentina |
Rachel Robrecht | SRR6922446 | 991054 | Alejandra, Argentina |
Raj Roy | SRR6922447 | 1957684 | Alejandra, Argentina |
Rebecca Smith | SRR6922448 | 1053520 | Alejandra, Argentina |
Jennifer Chandler | SRR6922316 | 913465 | Alejandra, Argentina |
Garrett Franklin | SRR6922314 | 533098 | Alejandra, Argentina |
Triston Walsh | SRR6922320 | 683277 | Alejandra, Argentina |
Zane Smith | SRR6922276 | 1111463 | El Recreo, Argentina |
Peter Tandy | SRR6922278 | 1223105 | El Recreo, Argentina |
Bryce Trull | SRR6922291 | 1154914 | El Recreo, Argentina |
Jackson Turner | SRR6922309 | 1771519 | El Recreo, Argentina |
Instructor | SRR6922470 | 1548618 | El Recreo, Argentina |
Go to the analysis directory within the EPP 622 course directory:
/pickett_shared/teaching/EPP622_Fall2022/analysis
...and make a personal analysis folder. For example:
mkdir <your user id goes here>
cd <your user id goes here>
Now, let's make a directory where we will run`fastqc:
mkdir 1_fastqc
cd 1_fastqc
We can create a soft link (symbolic link) to the raw data
ln -s ../../../raw_data/solenopsis_invicta/<your subset>.fastq .
Let's load fastqc:
spack load fastqc
Let's run the program now. Since, we all are sharing the same computing resource, we will run fastqc
on just one forward read fastq file -
fastqc <your subset>
This program outputs results in`.zip and .html formats. We can't inspect them on sphinx, so we'll need to copy them to our own devices.
scp '<your_username>@sphinx.ag.utk.edu:/pickett_shared/teaching/EPP622_Fall2022/analysis/<your_username>/1_fastqc/*html' ./
Skewer is a fast and accurate adapter trimmer for next-generation sequencing paired-end reads. It has several features such as detecting and removing adapter sequences, trimming sequences based on phred quality scores etc. Now, go to your personal analyses directory and make a new directory -
cd /pickett_shared/teaching/EPP622_Fall2022/analysis/<your name>
mkdir 2_skewer
cd 2_skewer
Soft link the raw data files here, too (the space is free!)
ln -s ../../../raw_data/solenopsis_invicta/<your subset>.fastq .
Skewer is installed locally on sphinx therefore, we won't have to use Spack to load it this time.
/sphinx_local/software/skewer/skewer -t 2 -l 95 -x AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG -Q 30 <your subset> -o <outfile name>
-t
stands for number of threads used by this command
-l
stands for minimum length of sequence we want to keep in our analyses
-Q
is the minimum mean quality score (Phred score) of the sequence (across the entire read length)
Note: Here we use Q 30 as an illustrative example because the data is already very high quality. In some instances, Q 30 may be considered on the more strict end of trimming thresholds (see Del Fabbro et al. 2013).
Say you wanted to trim all the files using a for loop. Here is an example of how to do that:
for f in *fastq
do
BASE=$( basename $f | sed 's/_1.fastq//g')
echo $BASE
/sphinx_local/software/skewer/skewer \
-t 2 -l 95 -Q 30 \
-x AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG \
$f -o $BASE
done
Now that we have trimmed our sequence file, let's check it's quality using fastqc. So, go back to your personal directory and make a new directory -
fastqc <your subset>-trimmed.fastq