BGA Practical 2 - NGS Data QC and Alignment.Rmd

---
title: "NGS Reads QC and Alignemnt"
author: "Beatriz Manso"
date: '2022-03-24'
output:
  distill::distill_article:
    toc: true
    toc_float: true
    toc_depth: 3
    number_sections: true
    code_folding: false
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction

Modern sequencing techniques enable researchers to sequence the genome at greater depth than ever before. NGS stands fro Next Generation Sequencing and allows for extremely high throughput, scalability, and speed. Using this technology, it is possible to determine nucleotide sequences within entire genomes or specific regions of DNA or RNA.

All high-throughput sequencing analyses require quality checks and processing of reads. 
Before we can quantify the genomic signal of interest and apply statistical and/or machine learning methods, we must process, quality check, and align the millions of reads generated by these sequencing experiments. 

In this tutorial, we will assess the read quality and processing which are fundamental steps in all high-throughput sequencing analyses.

# Methods

## 1. Set working directory

```{r eval=FALSE}
setwd("C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/6.BGA - Bioinformatics and Genome Analysis/week 2 - NGS data QC and alignment in R/practical")
```

## 2. Install necessary packages and load libraries

```{r}
if (!requireNamespace("BiocManager", quietly = TRUE))
   install.packages("BiocManager")
BiocManager::install("Rsubread")
BiocManager::install("ShortRead")
BiocManager::install("fastqcr")
BiocManager::install("Rqc")
BiocManager::install("QuasR")
BiocManager::install("ngsReports")
```

```{r}
library(fastqcr)
library(ShortRead)
library(ngsReports)
library(Rqc)
library(QuasR)
library(Rsubread)
```

## 3. Download Fasq data

We will use system.file() to upload data into a folder we will call
'mydata' in.

```{r}
mydata = system.file(package="ShortRead", "extdata/E-MTAB-1147")
```

Now feed *fastq.qz* files in "mydata" to the quality check function
rqc() to create the object qcRes:

```{r}
qcRes=rqc(path = mydata, pattern = ".fastq.gz", openBrowser=FALSE)

```

[Report is in
C:/Users/manso/AppData/Local/Temp/RtmpMbKuwN/rqc_report.html Pasting the
file path in browser window will show the html report.]{.underline}

## 4. Plot quality metrics

Plot a sequence quality per base/cycle metrics for our fastq files in
qcRes object use

```{r}
rqcCycleQualityBoxPlot(qcRes)
```

**What does this graph depicts?**

The graph above shows the sequence quality per base/cycle metrics, we
can observe that the quality of the reads degrades at the end which is
expected for long sequences. Good samples will have median quality
scores per base above 28. If scores are below 20 towards the ends, you
can think about trimming the reads.

**Why is the x-axis labelled as 'cycle'?**

Cycles correspond to bases/nucleotides along the read, and the number of
cycles is equivalent to the read length.

## 5. Find Sequence content per base/cycle

```{r eval=FALSE}
rqcCycleBaseCallsLinePlot(qcRes)
```

**What does the per base sequence content show?**

Per-base sequence content shows nucleotide proportions for each
position.

**This plot shows sequence bias -- what is the probable cause of the
bias?**

Some types of sequencing libraries can produce a biased sequence
composition. For example, in RNA-Seq, it is common to have bias at the
beginning of the reads. This happens because of random primers annealing
to the start of reads during RNA-Seq library preparation. These primers
are not truly random, which leads to a variation at the beginning of the
reads. Although RNA-seq experiments will usually have these biases, this
will not affect the ability of measuring gene expression.

## 6. Filtering and trimming read

It might be necessary to trim or filter the reads based on the quality
check results. It may be that adapters are present on either side of the
read, or that technical errors are causing the base quality to diminish
toward the ends of the read. This portion of the read should be trimmed,
in both cases, to allow alignment or better alignment of the genome.

We can use QuasR or ShortRead packages to trim and filter reads the
reads. However, QuasR::preprocessReads() function provides a single
interface to multiple pre-processing.

### 6.1. First, Obtain a list of fastq file paths:

```{r}
fastqFiles <- system.file(package="ShortRead",
                          "extdata/E-MTAB-1147",
                          c("ERR127302_1_subset.fastq.gz", "ERR127302_2_subset.fastq.gz"))

```

For this tutorial we are using built in training data from the package -
and we access this data using 'system.file' command. When using our own
data we must create the correct filepath for our data.

### 6.2. Next, Define processed fastq file names:

```{r}
outfiles <- paste(tempfile(pattern=c("processed_1_", "processed_2_")),".fastq",sep="")
```

### 6.3. Then we Process fastq files with the following parameters as an example:

1.  Remove reads that have more than 1 N, (nBases)
2.  trim 3 bases from the end of the reads (truncateEndBases)
3.  remove ACCCGGGA patern if it occurs at the start (Lpattern)
4.  remove reads shorter than 40 base-pairs (minLength)

```{r}
preprocessReads(fastqFiles, outfiles, nBases=1,
                truncateEndBases=3,Lpattern="ACCCGGGA",
                minLength=40)

```

### 6.4. Filter out reads with quality score below 20:

-   Obtain a list of fastq file paths:

```{r}
fastqFile <- system.file(package="ShortRead",
                         "extdata/E-MTAB-1147",
                         "ERR127302_1_subset.fastq.gz")
```

-   Read fastq file:

```{r}
fq = readFastq(fastqFile)
```

-   Get quality scores per base as a matrix:

```{r}
qPerBase = as(quality(fq), "matrix")
```

-   Get the number of bases per read that have quality score below 20:

```{r}
qcount = rowSums( qPerBase <= 20) 
```

-   Number of reads where all Phred scores \>= 20

```{r}
fq[qcount == 0]
```

-   Write out fastq file with only reads where all quality scores per
    base are above 20:

```{r}
writeFastq(fq[qcount == 0], paste(fastqFile, "Qfiltered", sep="_"))
```

ShortRead::FastqStreamer() - streams the fastq file in blocks, each with
a pre-defined number of reads. Which can be accessed using the yield()
function. As we call the yield() function after opening a fastq file
with FastqStreamer(), a new section of the file is read into memory.

-   Set up a streaming with block size 1000. Every time you call the
    yield() function 1000 read portion of the file will be read
    successively.

```{r}
f <- FastqStreamer(fastqFile, readerBlockSize=1000)
```

-   Set up a while loop to call yield() function to go through the file:

```{r}
while(length(fq <- yield(f))) { 
  qPerBase = as(quality(fq), "matrix") 
  qcount = rowSums( qPerBase <= 20) 
  writeFastq(fq[qcount == 0],
             paste(fastqFile, "Qfiltered", sep="_"),
             mode="a")
}
```

This loop will remove reads where all quality scores are \< 20, get
quality scores per base as a matrix; get number of bases per read that
have Q score \< 20, write fastq file with mode="a", to every new block,
and written out to the same file.

## 7. Map or align the reads to the genome (library(QuasR) is needed)

### 7.1. Copy example data to current working directory:

```{r}
file.copy(system.file(package="QuasR", "extdata"), ".", recursive=TRUE)
```

### 7.2. Get genome file in fasta format:

```{r}
genomeFile <- "extdata/hg19sub.fa"
```

### 7.3. Get text file containing sample names and fastq file paths:

```{r}
sampleFile <- "extdata/samples_chip_single.txt"

```

### 7.4. Create alignments:

```{r}
proj <- qAlign(sampleFile, genomeFile)
```

With qAlign(), alignment parameters are automatically selected based on
different types of alignment problems

## 8. Map or align the reads to the genome with [Rsubread]{.underline}

Data files needed:

-   Mouse chromosome

-   1 Rsubread index files (\~400MB);

-   Targets2.txt

-   The 12 fastq.gz files for the mouse dataset: 12 fastq files with
    1000 reads each

-   4 index files for chr 1 for mm10, targets files with sample
    information.

1.  Put all the sequencing read data (.fastq.gz files) in your data
    directory.
2.  Find the files and command Rsubread aligner which files to look at.

### 8.1. Search for all .fastq.gz files in the directory using the list.files command.

```{r}
fastq.files <- list.files(path = "C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/6.BGA - Bioinformatics and Genome Analysis/week 2 - NGS data QC and alignment in R/practical/MouseMamRNA-Seq", pattern = ".fastq.gz$", full.names = TRUE)

fastq.files
```

The pattern argument takes a regular expression. In this case, we are
using the "\$" to mean the end of the string, so we will only get files
that end in ".fastq.gz"

### 8.2. Build the Index:

```{r}
#buildindex(basename="./reference_index",reference=ref)
buildindex(basename="chr1_mm10",reference="chr1.fa")
```

With our index generated, we can align the reads with the align command.

According to align's default settings, it only keeps reads that uniquely
map to the reference genome. We need this when testing differential gene
expression, because each read is unambiguously assigned to one place in
the genome, making interpretation of results easier.

### 8.3. Align the reads using the align command:

```{r}
#to view and change parameters run: args(align) 

align("chr1_mm10", readfile1 = fastq.files, 
      useAnnotation=T, nthreads=10, annot.inbuilt="mm10",
      phredOffset = 64 )
```

If we have paired end data, we would need to specify the second read
file/s using the readfile2 argument. The mouse data for this experiment
are 100bp single end reads.(this might take very long time to complete)

We can specify the output files, or we can let Rsubread choose the
output file names for us. The default output file name is the filename
with ".subread.BAM" added at the end.

### 8.4. Get a summary of the proportion of reads that mapped to the reference genome using the `propmapped` function:

```{r}
bam.files <- list.files(path = "C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/6.BGA - Bioinformatics and Genome Analysis/week 2 - NGS data QC and alignment in R/practical/MouseMamRNA-Seq", pattern = ".BAM$", full.names = TRUE)

bam.files
```

```{r}
props <- propmapped(files=bam.files)
props
```

As a result of the alignment, there are several BAM files, each
containing read alignments for each library. BAM files contain the
chromosomal location of each read.

## 9. Quality control after alignemnt

By using the qualityScores function in Rsubread, we can examine the
quality scores associated with each base that has been called by the
sequencing machine.

### 9.1. Extract quality scores(qs) for 100 reads

```{r}
qs <- qualityScores(filename="C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/6.BGA - Bioinformatics and Genome Analysis/week 2 - NGS data QC and alignment in R/practical/MouseMamRNA-Seq/SRR1552450.fastq.gz",nreads=100)
```

### 9.2. Check dimension of qs

```{r}
dim(qs)
```

### 9.3. Check first few elements of qs with head:

```{r}
head(qs)
```

### 9.4. Visualise Quality scores

With a quality score of 30, the likelihood of making an incorrect call
is 1 in 1000. (A quality score of 10 is a 1 in 10 chance of an incorrect
base call.) A boxplot can be used to examine the overall distribution of
quality scores across the 100 reads.

```{r}
boxplot(qs)
```

## 10. Counting

The featureCounts() function can be used to calculate the number of
mapped reads across mouse genes. There are built-in annotations for
mouse genome assemblies (mm9, mm10) and human genome assemblies (hg19)
(NCBI refseq annotations).

featureCounts() takes all the BAM files as input, and outputs an object
which includes the count matrix. Each sample is a separate column, each
row is a gene.

```{r}
fc <- featureCounts(bam.files, annot.inbuilt="mm10")

names(fc) #See what slots are stored in fc
```

"fc\$stat" shows the statistics of the read mapping, also reporting the
unassigned reads and the reasons why they weren't assigned:

```{r}
fc$stat
```

In this matrix, the rows denote the Entrez gene identifiers for each
gene, and the columns denote the output filenames from the align
function.

FeatureCounts() compiled reads over genes using annotation information
in the annotation slot.

```{r}
head(fc$annotation)
```