-
Notifications
You must be signed in to change notification settings - Fork 24
FastQC
I used to use a yeast dataset for this class, but then the genomics lesson of data carpentry (DC) was written and uses this cool E. coli dataset, so we're going to try that this semester.
Blount et al. 2008 Historical contingency and the evolution of a key innovation in an experimental population of Escherichia coli
The experiment was designed to assess adaptation in E. coli. A population was propagated for more than 40,000 generations in a glucose-limited minimal medium (in most conditions glucose is the best carbon source for E. coli, providing faster growth than other sugars). This medium was supplemented with citrate, which E. coli cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points revealed that spontaneous citrate-using variant (Cit+) appeared between 31,000 and 31,500 generations, causing an increase in population size and diversity. In addition, this experiment showed hypermutability in certain regions. Hypermutability is important and can help accelerate adaptation to novel environments, but also can be selected against in well-adapted populations.
We will be working with three sample events from the Ara-3 strain of this experiment, one from 5,000 generations, one from 15,000 generations, and one from 50,000 generations. The population changed substantially during the course of the experiment, and we will be exploring how (the evolution of a Cit+ mutant and hypermutability) with our variant calling workflow.
More information about the data.
Log in to the ACF, get an interactive terminal and go to your personal project folder in the class directory. Start a project directory (inside your home directory)
mkdir e_coli
cd e_coli
Start a raw data directory
mkdir raw_data
cd raw_data
I have already downloaded the data, so you will not have to. But suppose you need to download public data yourself in the future? You can download using URLs from ENA - this is how it is demonstrated in the DC lesson. You can also use the SRA toolkit, which is a command-line utility for getting SRA data from NCBI.
We need 6 files representing 3 samples (2 files per sample because they are paired-end reads). The three SRAs are: SRR2589044, SRR2584863, and SRR2584863. I have put them in /lustre/haven/proj/UTK0138/e_coli_data. Lets check them out.
ls -lh ../../../e_coli_data
head ../../../e_coli_data/SRR2584863_1.fastq.gz
wc -l ../../../e_coli_data/SRR2584863_1.fastq
wc -l ../../../e_coli_data/SRR2584863_2.fastq
head ../../../e_coli_data/SRR2584863_1.fastq
head ../../../e_coli_data/SRR2584863_2.fastq
- How much space does gzipping a file save?
- What does a gz file look like when you look at the head? Why?
- How many lines are in the unzipped files? How many sequences do they have?
- How are the uncompressed fastq files structured? How are the pairs arranged? How do you know they are pairs?
- Do the few reads we've seen have good quality values? How are the quality values changing over the length of the read?
You are going to create a symbolic link (also called a soft link) to the. gz files.
What is a symbolic link? A special type of file that points to another file or directory. Kind of like a bookmark.
Why are we using it? Because we don't need to store these files 40 times for every student. This saves room! We'll also see how it allows you to keep your directories need and organized without copying files all over the place.
You should be in your raw_data folder. Check.
pwd
ln -fs ../../../e_coli_data/*gz .
ls
ls -l
Now you can see the files and interact with them as though they are in your current directory (even though they aren't!).
Remember, we want to put our analysis in a sensible directory structure. Lets get that going...
cd ../
mkdir analysis
cd analysis
mkdir 1_fastqc
cd 1_fastqc
Learn all about FastQC.
We want to document all our steps. So let's start a file called commands.sh
. In that file, lets put some notes:
## September 5 2020
## Quality check of raw data
First we need to load the module (its already been installed on the ACF). Then we can check out the options for running it.
module load fastqc
fastqc -help | less
This program is pretty smart and will run each file individually if we give it a list of files. So we can run it on all files like this:
fastqc -t 2 -o . ../../raw_data/*gz
- What is each parameter doing?
Did it work? What kind of output did it produce?
ls
Don't forget to keep documenting! Lets add these commands to our commands.sh file. If you aren't sure what you've run, you can use history
. We can also add the version of fastqc, which you can find with fastqc -version
.
I mentioned in the recording of the lecture that fastqc outputs HTML files, which are difficult to view on the command line. We're going to copy them to our laptops. This is a very useful thing to know how to do.
Here's an example of an scp command, to be run from a terminal on your computer (not from the acf terminal). Replace both instances of username with your own username.
scp [email protected]:/lustre/haven/proj/UTK0138/username/e_coli/analysis/1_fastqc/*html .
More info about random hexamer priming not being so random
Learn all about MultiQC
This is a super awesome software tool that aggregates results from other tools and makes them prettier and easier to interpret. I haven't been able to install on ACF but you could probably get it running on your own laptop using conda, which is a python package manager we'll talk about later. I went ahead and merged our FastQC files to make a MultiQC file, then I uploaded it to /lustre/haven/proj/UTK0138/e_coli_data/multiqc_report.html
. You can secure copy it to your laptop to view.
From a terminal connected to your laptop (not the ACF):
scp [email protected]:/lustre/haven/proj/UTK0138/e_coli_data/multiqc_report.html .