-
Notifications
You must be signed in to change notification settings - Fork 1
/
BGA Practical 2 - NGS Data QC and Alignment.Rmd
420 lines (299 loc) · 12.7 KB
/
BGA Practical 2 - NGS Data QC and Alignment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
---
title: "NGS Reads QC and Alignemnt"
author: "Beatriz Manso"
date: '2022-03-24'
output:
distill::distill_article:
toc: true
toc_float: true
toc_depth: 3
number_sections: true
code_folding: false
editor_options:
markdown:
wrap: 72
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
Modern sequencing techniques enable researchers to sequence the genome at greater depth than ever before. NGS stands fro Next Generation Sequencing and allows for extremely high throughput, scalability, and speed. Using this technology, it is possible to determine nucleotide sequences within entire genomes or specific regions of DNA or RNA.
All high-throughput sequencing analyses require quality checks and processing of reads.
Before we can quantify the genomic signal of interest and apply statistical and/or machine learning methods, we must process, quality check, and align the millions of reads generated by these sequencing experiments.
In this tutorial, we will assess the read quality and processing which are fundamental steps in all high-throughput sequencing analyses.
# Methods
## 1. Set working directory
```{r eval=FALSE}
setwd("C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/6.BGA - Bioinformatics and Genome Analysis/week 2 - NGS data QC and alignment in R/practical")
```
## 2. Install necessary packages and load libraries
```{r}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Rsubread")
BiocManager::install("ShortRead")
BiocManager::install("fastqcr")
BiocManager::install("Rqc")
BiocManager::install("QuasR")
BiocManager::install("ngsReports")
```
```{r}
library(fastqcr)
library(ShortRead)
library(ngsReports)
library(Rqc)
library(QuasR)
library(Rsubread)
```
## 3. Download Fasq data
We will use system.file() to upload data into a folder we will call
'mydata' in.
```{r}
mydata = system.file(package="ShortRead", "extdata/E-MTAB-1147")
```
Now feed *fastq.qz* files in "mydata" to the quality check function
rqc() to create the object qcRes:
```{r}
qcRes=rqc(path = mydata, pattern = ".fastq.gz", openBrowser=FALSE)
```
[Report is in
C:/Users/manso/AppData/Local/Temp/RtmpMbKuwN/rqc_report.html Pasting the
file path in browser window will show the html report.]{.underline}
## 4. Plot quality metrics
Plot a sequence quality per base/cycle metrics for our fastq files in
qcRes object use
```{r}
rqcCycleQualityBoxPlot(qcRes)
```
**What does this graph depicts?**
The graph above shows the sequence quality per base/cycle metrics, we
can observe that the quality of the reads degrades at the end which is
expected for long sequences. Good samples will have median quality
scores per base above 28. If scores are below 20 towards the ends, you
can think about trimming the reads.
**Why is the x-axis labelled as 'cycle'?**
Cycles correspond to bases/nucleotides along the read, and the number of
cycles is equivalent to the read length.
## 5. Find Sequence content per base/cycle
```{r eval=FALSE}
rqcCycleBaseCallsLinePlot(qcRes)
```
**What does the per base sequence content show?**
Per-base sequence content shows nucleotide proportions for each
position.
**This plot shows sequence bias -- what is the probable cause of the
bias?**
Some types of sequencing libraries can produce a biased sequence
composition. For example, in RNA-Seq, it is common to have bias at the
beginning of the reads. This happens because of random primers annealing
to the start of reads during RNA-Seq library preparation. These primers
are not truly random, which leads to a variation at the beginning of the
reads. Although RNA-seq experiments will usually have these biases, this
will not affect the ability of measuring gene expression.
## 6. Filtering and trimming read
It might be necessary to trim or filter the reads based on the quality
check results. It may be that adapters are present on either side of the
read, or that technical errors are causing the base quality to diminish
toward the ends of the read. This portion of the read should be trimmed,
in both cases, to allow alignment or better alignment of the genome.
We can use QuasR or ShortRead packages to trim and filter reads the
reads. However, QuasR::preprocessReads() function provides a single
interface to multiple pre-processing.
### 6.1. First, Obtain a list of fastq file paths:
```{r}
fastqFiles <- system.file(package="ShortRead",
"extdata/E-MTAB-1147",
c("ERR127302_1_subset.fastq.gz", "ERR127302_2_subset.fastq.gz"))
```
For this tutorial we are using built in training data from the package -
and we access this data using 'system.file' command. When using our own
data we must create the correct filepath for our data.
### 6.2. Next, Define processed fastq file names:
```{r}
outfiles <- paste(tempfile(pattern=c("processed_1_", "processed_2_")),".fastq",sep="")
```
### 6.3. Then we Process fastq files with the following parameters as an example:
1. Remove reads that have more than 1 N, (nBases)
2. trim 3 bases from the end of the reads (truncateEndBases)
3. remove ACCCGGGA patern if it occurs at the start (Lpattern)
4. remove reads shorter than 40 base-pairs (minLength)
```{r}
preprocessReads(fastqFiles, outfiles, nBases=1,
truncateEndBases=3,Lpattern="ACCCGGGA",
minLength=40)
```
### 6.4. Filter out reads with quality score below 20:
- Obtain a list of fastq file paths:
```{r}
fastqFile <- system.file(package="ShortRead",
"extdata/E-MTAB-1147",
"ERR127302_1_subset.fastq.gz")
```
- Read fastq file:
```{r}
fq = readFastq(fastqFile)
```
- Get quality scores per base as a matrix:
```{r}
qPerBase = as(quality(fq), "matrix")
```
- Get the number of bases per read that have quality score below 20:
```{r}
qcount = rowSums( qPerBase <= 20)
```
- Number of reads where all Phred scores \>= 20
```{r}
fq[qcount == 0]
```
- Write out fastq file with only reads where all quality scores per
base are above 20:
```{r}
writeFastq(fq[qcount == 0], paste(fastqFile, "Qfiltered", sep="_"))
```
ShortRead::FastqStreamer() - streams the fastq file in blocks, each with
a pre-defined number of reads. Which can be accessed using the yield()
function. As we call the yield() function after opening a fastq file
with FastqStreamer(), a new section of the file is read into memory.
- Set up a streaming with block size 1000. Every time you call the
yield() function 1000 read portion of the file will be read
successively.
```{r}
f <- FastqStreamer(fastqFile, readerBlockSize=1000)
```
- Set up a while loop to call yield() function to go through the file:
```{r}
while(length(fq <- yield(f))) {
qPerBase = as(quality(fq), "matrix")
qcount = rowSums( qPerBase <= 20)
writeFastq(fq[qcount == 0],
paste(fastqFile, "Qfiltered", sep="_"),
mode="a")
}
```
This loop will remove reads where all quality scores are \< 20, get
quality scores per base as a matrix; get number of bases per read that
have Q score \< 20, write fastq file with mode="a", to every new block,
and written out to the same file.
## 7. Map or align the reads to the genome (library(QuasR) is needed)
### 7.1. Copy example data to current working directory:
```{r}
file.copy(system.file(package="QuasR", "extdata"), ".", recursive=TRUE)
```
### 7.2. Get genome file in fasta format:
```{r}
genomeFile <- "extdata/hg19sub.fa"
```
### 7.3. Get text file containing sample names and fastq file paths:
```{r}
sampleFile <- "extdata/samples_chip_single.txt"
```
### 7.4. Create alignments:
```{r}
proj <- qAlign(sampleFile, genomeFile)
```
With qAlign(), alignment parameters are automatically selected based on
different types of alignment problems
## 8. Map or align the reads to the genome with [Rsubread]{.underline}
Data files needed:
- Mouse chromosome
- 1 Rsubread index files (\~400MB);
- Targets2.txt
- The 12 fastq.gz files for the mouse dataset: 12 fastq files with
1000 reads each
- 4 index files for chr 1 for mm10, targets files with sample
information.
1. Put all the sequencing read data (.fastq.gz files) in your data
directory.
2. Find the files and command Rsubread aligner which files to look at.
### 8.1. Search for all .fastq.gz files in the directory using the list.files command.
```{r}
fastq.files <- list.files(path = "C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/6.BGA - Bioinformatics and Genome Analysis/week 2 - NGS data QC and alignment in R/practical/MouseMamRNA-Seq", pattern = ".fastq.gz$", full.names = TRUE)
fastq.files
```
The pattern argument takes a regular expression. In this case, we are
using the "\$" to mean the end of the string, so we will only get files
that end in ".fastq.gz"
### 8.2. Build the Index:
```{r}
#buildindex(basename="./reference_index",reference=ref)
buildindex(basename="chr1_mm10",reference="chr1.fa")
```
With our index generated, we can align the reads with the align command.
According to align's default settings, it only keeps reads that uniquely
map to the reference genome. We need this when testing differential gene
expression, because each read is unambiguously assigned to one place in
the genome, making interpretation of results easier.
### 8.3. Align the reads using the align command:
```{r}
#to view and change parameters run: args(align)
align("chr1_mm10", readfile1 = fastq.files,
useAnnotation=T, nthreads=10, annot.inbuilt="mm10",
phredOffset = 64 )
```
If we have paired end data, we would need to specify the second read
file/s using the readfile2 argument. The mouse data for this experiment
are 100bp single end reads.(this might take very long time to complete)
We can specify the output files, or we can let Rsubread choose the
output file names for us. The default output file name is the filename
with ".subread.BAM" added at the end.
### 8.4. Get a summary of the proportion of reads that mapped to the reference genome using the `propmapped` function:
```{r}
bam.files <- list.files(path = "C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/6.BGA - Bioinformatics and Genome Analysis/week 2 - NGS data QC and alignment in R/practical/MouseMamRNA-Seq", pattern = ".BAM$", full.names = TRUE)
bam.files
```
```{r}
props <- propmapped(files=bam.files)
props
```
As a result of the alignment, there are several BAM files, each
containing read alignments for each library. BAM files contain the
chromosomal location of each read.
## 9. Quality control after alignemnt
By using the qualityScores function in Rsubread, we can examine the
quality scores associated with each base that has been called by the
sequencing machine.
### 9.1. Extract quality scores(qs) for 100 reads
```{r}
qs <- qualityScores(filename="C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/6.BGA - Bioinformatics and Genome Analysis/week 2 - NGS data QC and alignment in R/practical/MouseMamRNA-Seq/SRR1552450.fastq.gz",nreads=100)
```
### 9.2. Check dimension of qs
```{r}
dim(qs)
```
### 9.3. Check first few elements of qs with head:
```{r}
head(qs)
```
### 9.4. Visualise Quality scores
With a quality score of 30, the likelihood of making an incorrect call
is 1 in 1000. (A quality score of 10 is a 1 in 10 chance of an incorrect
base call.) A boxplot can be used to examine the overall distribution of
quality scores across the 100 reads.
```{r}
boxplot(qs)
```
## 10. Counting
The featureCounts() function can be used to calculate the number of
mapped reads across mouse genes. There are built-in annotations for
mouse genome assemblies (mm9, mm10) and human genome assemblies (hg19)
(NCBI refseq annotations).
featureCounts() takes all the BAM files as input, and outputs an object
which includes the count matrix. Each sample is a separate column, each
row is a gene.
```{r}
fc <- featureCounts(bam.files, annot.inbuilt="mm10")
names(fc) #See what slots are stored in fc
```
"fc\$stat" shows the statistics of the read mapping, also reporting the
unassigned reads and the reasons why they weren't assigned:
```{r}
fc$stat
```
In this matrix, the rows denote the Entrez gene identifiers for each
gene, and the columns denote the output filenames from the align
function.
FeatureCounts() compiled reads over genes using annotation information
in the annotation slot.
```{r}
head(fc$annotation)
```