-
Notifications
You must be signed in to change notification settings - Fork 0
/
Cayo16S_dada2Code.Rmd
341 lines (230 loc) · 14.5 KB
/
Cayo16S_dada2Code.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
---
title: "Cayo16S_dada2Code"
---
Load packages:
```{r packages, message=FALSE}
#install.packages("ggplot2")
#library(ggplot2)
x<-c("ggplot2", "dada2", "vegan", "tidyr","car","DESeq2", "phyloseq","FSA", "lme4", "ggpubr", 'reshape2')
lapply(x, require, character.only = TRUE)
```
## First step: remove barcodes, indices, and primers from reads:
These are Nextera-based adapter sequences.
Can be removed with cutadapt or similar, or removed with trimLeft instead during filter and trim step below.
Load metadata:
```{r}
dataset <- read.delim("./Cayo16S_metadata_Aging_MS.txt",header=T,row.names = 1)
dataset$sample_ID <- rownames(dataset)
dataset$age[is.na(dataset$age)] <- -1
dataset$control <- "No"
for (i in 1:length(dataset$sample_type)){
if (dataset$sample_type[i]== "negcontrol"){
dataset$control[i] <- "Yes"
}
if (dataset$sample_type[i]== "poscontrol"){
dataset$control[i] <- "Yes"
}
}
dataset$infant <- "No"
for (i in 1:length(dataset$sample_type)){
if (dataset$age[i] == 0){
dataset$infant[i] <- "Yes"
}
}
dataset$age_group <- "<1"
for (i in 1:length(dataset$sample_type)){
if (dataset$age[i] == 1| dataset$age[i] == 2| dataset$age[i] == 3 | dataset$age[i] == 4){
dataset$age_group[i] <- "1-4"
}
else if (dataset$age[i] == 5| dataset$age[i] == 6| dataset$age[i] == 7 | dataset$age[i] == 8 | dataset$age[i] == 9){
dataset$age_group[i] <- "5-9"
}
else if (dataset$age[i] == 10 | dataset$age[i] == 11| dataset$age[i] == 12 | dataset$age[i] == 13 | dataset$age[i] == 14){
dataset$age_group[i] <- "10-14"
}
else if (dataset$age[i] == 15| dataset$age[i] >= 15){
dataset$age_group[i] <- "≥15"
}
else if (dataset$age[i] == -1){
dataset$age_group[i] <- NA
}
}
dataset$age_group[is.na(dataset$age_group)] <- "other"
dataset[dataset==-1] <- NA
dataset$sex <- as.character(dataset$sex)
buccal_dataset <- subset(dataset, sample_type=="buccal" | sample_ID == "NegCtrl5")
rectal_dataset <- subset(dataset, sample_type == "rectal" | sample_ID == "NegCtrl13")
genital_dataset <- subset(dataset, sample_type == "genital" | sample_ID == "NegCtrl3")
```
## Handling 16S DNA sequences with DADA2
Dada2 Tutorial for 16S: https://benjjneb.github.io/dada2/tutorial.html
###1. Set path to your folder containing the sequences
```{r path}
# CHANGE ME to the directory containing the fastq files after unzipping.
path <- "~/Documents/Cayo_Microbiome/sequences/unzipped_trimmed"
list.files(path)
```
###2. File names
First, we read in the names of the fastq files, and perform some string manipulation.
Sort ensures forward/reverse files are in same order.
Then we extract sample names from our forward list.
**Here the string manipulation has to be tailored to each different run.**
```{r names}
fnFs <- sort(list.files(path, pattern="_R1_001.fastq")) # forward files
fnRs <- sort(list.files(path, pattern="_R2_001.fastq")) # reverse files
sample.names <- sapply(strsplit(fnFs, "_"), `[`, 1)
sample.names
fnFs <- file.path(path, fnFs)
fnRs <- file.path(path, fnRs)
```
###3. Inspect quality profiles
The median quality score at each position is shown by the green line, and the quartiles of the quality score distribution by the orange lines.
The red line shows the scaled proportion of reads that extend to at least that position (this is more useful for other sequencing technologies, as Illumina reads are typically all the same length, hence the flat red line).
#### Let's look first at the forward reads
```{r quality forward reads}
plotQualityProfile(fnFs[15:16])
```
##### Now let's look at the reverse reads
```{r quality reverse reads}
plotQualityProfile(fnRs[15:16])
```
###4. Filter and trim
I will truncate the forward reads at position 240 (trimming the last 20 nucleotides).
The reverse reads are of significantly worse quality, especially at the end, which is common in Illumina sequencing. This isn’t too worrisome, as DADA2 incorporates quality information into its error model which makes the algorithm robust to lower quality sequence, but trimming as the average qualities crash will improve the algorithm’s sensitivity to rare sequence variants. Based on these profiles, we will truncate the reverse reads at position 200.
**Considerations for your own data** Your reads must still overlap after truncation in order to merge them later! We are using 2x250 V4 sequence data, our trimming has to ensure overlap is happening (your **truncLen** must be large enough to maintain 20 + biological.length.variation nucleotides of overlap between them).
#### Place filtered files in filtered/ subdirectory
```{r filtered}
filt_path <- file.path(path, "filtered")
filtFs <- file.path(filt_path, paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(filt_path, paste0(sample.names, "_R_filt.fastq.gz"))
```
#### Filter the reads
This step usually takes the longest and the most part of your CPU power. On Cayo 16S data, it took about 10 minutes.
```{r filter}
help("filterAndTrim")
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,200), trimLeft = c(19,20),
maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread=TRUE)
head(out)
```
The maxEE parameter sets the maximum number of “expected errors” allowed in a read, which is a better filter than simply averaging quality scores.
**Considerations for your own data** The filtering parameters are starting points, not set in stone.
If you want to speed up downstream computation, consider tightening maxEE.
If too few reads are passing the filter, consider relaxing maxEE, perhaps especially on the reverse reads (eg. maxEE=c(2,5)), and reducing the truncLen to remove low quality tails.
Remember though, when choosing truncLen for paired-end reads you must maintain overlap after truncation in order to merge them later.
###5. Learn the error rates
The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates.
The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution.
As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).
```{r error rates}
errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)
plotErrors(errF, nominalQ=TRUE)
```
* The error rates for each possible transition (A→C, A→G, …) are shown.
* Points are the observed error rates for each consensus quality score.
* The black line shows the estimated error rates after convergence of the machine-learning algorithm.
* The red line shows the error rates expected under the nominal definition of the Q-score.
* Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected.
* If everything looks reasonable, we can proceed with confidence.
**Considerations for your own data** Parameter learning is computationally intensive, so by default the learnErrors function uses only a subset of the data (the first 100M bases).
**If you are working with a large dataset** and the plotted error model does not look like a good fit, you can try increasing the nbases parameter to see if the fit improves.
###6. Dereplication
* Dereplication combines all identical sequencing reads into into “unique sequences” with a corresponding “abundance” equal to the number of reads with that unique sequence.
* Dereplication substantially reduces computation time by eliminating redundant comparisons.
* Dereplication in the DADA2 pipeline has one crucial addition from other pipelines: DADA2 retains a summary of the quality information associated with each unique sequence.
* The consensus quality profile of a unique sequence is the average of the positional qualities from the dereplicated reads.
* These quality profiles inform the error model of the subsequent sample inference step, significantly increasing DADA2’s accuracy.
```{r dereplication}
derepFs <- derepFastq(filtFs, verbose=F)
derepRs <- derepFastq(filtRs, verbose=F)
# Name the derep-class objects by the sample names
names(derepFs) <- sample.names
names(derepRs) <- sample.names
```
###7. Sample inference
```{r inference}
dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)
#Inspection
dadaFs[[1]]
dadaRs[[1]]
```
###8. Merge paired reads
* We now merge the forward and reverse reads together to obtain the full denoised sequences.
* Merging is performed by aligning the denoised forward reads with the reverse-complement of the corresponding denoised reverse reads, and then constructing the merged “contig” sequences.
* By default, merged sequences are only output if the forward and reverse reads overlap by at least 12 bases, and are identical to each other in the overlap region.
```{r merging}
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
# Inspect the merger data.frame from the first sample
head(mergers[[250]])
```
* The mergers object is a list of data.frames from each sample.
* Each data.frame contains the merged $sequence, its $abundance, and the indices of the $forward and $reverse sequence variants that were merged.
* Paired reads that did not exactly overlap were removed by mergePairs, further reducing spurious output.
###9. Construct sequence table
* We can now construct an amplicon sequence variant table (ASV) table, a higher-resolution version of the OTU table produced by traditional methods.
```{r sequence table}
seqtab <- makeSequenceTable(mergers)
dim(seqtab)
# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab)))
# Remove non-target-length sequences with base R manipulations of the sequence table
seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% seq(252,254)]
table(nchar(getSequences(seqtab2)))
seqtab<-seqtab2
```
* The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants.
**Considerations for your own data** Sequences that are much longer or shorter than expected may be the result of non-specific priming.
###10. Remove chimeras
* The core dada2 method corrects substitution and indel errors, but chimeras remain.
* Fortunately, the accuracy of the sequence variants after denoising makes identifying chimeras simpler than it is when dealing with fuzzy OTUs.
* Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences.
```{r chimeras}
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
dim(seqtab.nochim)
sum(seqtab.nochim)/sum(seqtab)
#save seqtab.nochim as R object for later use in other analyses:
saveRDS(seqtab.nochim, "./seqtab.nochim_Cayo16S.rds")
```
* The frequency of chimeric sequences varies substantially from dataset to dataset, and depends on on factors including experimental procedures and sample complexity.
**Considerations for your own data** Most of your reads should remain after chimera removal (it is not uncommon for a majority of sequence variants to be removed though).
* If most of your reads were removed as chimeric, upstream processing may need to be revisited.
* In almost all cases this is caused by primer sequences with ambiguous nucleotides that were not removed prior to beginning the DADA2 pipeline.
###11. Track reads through the pipeline
As a final check of our progress, we’ll look at the number of reads that made it through each step in the pipeline:
```{r track}
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
track
```
Looks good! We kept the majority of our raw reads, and there is no over-large drop associated with any single step.
**Considerations for your own data** This is a great place to do a last sanity check.
Outside of filtering (depending on how stringent you want to be) there should no step in which a majority of reads are lost.
If a majority of reads failed to merge, you may need to revisit the truncLen parameter used in the filtering step and make sure that the truncated reads span your amplicon.
If a majority of reads were removed as chimeric, you may need to revisit the removal of primers, as the ambiguous nucleotides in unremoved primers interfere with chimera identification.
###12. Assign taxonomy
The DADA2 package provides a native implementation of the naive Bayesian classifier method for this purpose.
The assignTaxonomy function takes as input a set of sequences to be classified and a training set of reference sequences with known taxonomy, and outputs taxonomic assignments with at least minBoot bootstrap confidence.
We maintain formatted training fastas for the RDP training set, GreenGenes clustered at 97% identity, and the Silva reference database, and additional trainings fastas suitable for protists and certain specific environments have been contributed.
**Extensions** The dada2 package also implements a method to make species level assignments based on exact matching between ASVs and sequenced reference strains.
Recent analysis suggests that exact matching (or 100% identity) is the only appropriate way to assign species to 16S gene fragments.
Currently, species-assignment training fastas are available for the Silva and RDP 16S databases. The fastas used here can be downloaded from https://doi.org/10.5281/zenodo.1172783
**Download Silva taxonomic training data before proceeding**
(I saved them to a new folder called "tax" in this case)
```{r taxonomy}
taxa <- assignTaxonomy(seqtab.nochim, "./tax/silva_nr_v132_train_set.fa.gz", multithread=TRUE)
taxa <- addSpecies(taxa, "./tax/silva_species_assignment_v132.fa.gz")
taxa.print <- taxa # Removing sequence rownames for display only
rownames(taxa.print) <- NULL
head(taxa.print)
```
###13. Handoff to Phyloseq package and saving otu table/taxa table for future use
The rownames from your metadata have to match the rownames from your sequencing file.
```{r phyloseq}
write.csv(taxa,"./taxaGF.csv")
```
Continue with markdown files "global analyses" or body site specific markdowns for downstream analyses.