mim2_bioinformatics.qmd

---
title: "DADA2 bioinformatic pipeline for the Mimulus ITS sequence dataset"
author: "Bolívar Aponte Rolón and Mareli Sánchez Juliá"
date: "Last edited: `r format(Sys.time(), '%B %d, %Y')`"
format: 
  html:
    toc: true
    toc-location: left
    toc-depth: 2
    number-sections: true
    number-depth: 1
    theme: lumen
    highlight-style: github
    code-overflow: wrap
    code-fold: false
    code-copy: true
    code-link: false
    code-tools: false
    code-block-border-left: "#0C3823"
    code-block-bg: "#eeeeee"
    fig-cap-location: margin
    linestretch: 1.25
    fontsize: "large"
    embed-resources: true
execute:
  echo: true
  keep-md: true
editor: 
  markdown: 
    wrap: 72
---

```{r setup}
knitr::opts_chunk$set(out.width ='70%', fig_align = 'center', echo = TRUE, collapse = TRUE, eval=FALSE)
```

# DADA2 Pipeline

## Installation

See https://benjjneb.github.io/dada2/dada-installation.html for more
details. The main R script can be found
[here](https://www.bioconductor.org/packages/devel/bioc/vignettes/dada2/inst/doc/dada2-intro.R)

Following DADA2 ITS Pipeline tutorial,
\[v1.18\](https://benjjneb.github.io/dada2/ITS_workflow.html. Also Emily
Farrer's
[code](https://github.com/ecfarrer/LAmarshGradient2/blob/master/BioinformaticsITS.R)
on github "LAmarshGradient2 ITS code.

::: callout-info
**To carry out this pipeline in an HPC cluster see section: [Cypress:
How to submit this as a SLURM job.]**
:::

### Packages required

```{r, Pre-requisites}
#| eval: true
#| echo: false
#| tidy: true
#| warning: false
# Activate commands as needed.
# change the ref argument to get other versions
# DADA2 package and associated
# if (!require("BiocManager", quietly = TRUE)){
#     install.packages("BiocManager", repo="http://cran.rstudio.com/")
#   }
# BiocManager::install(version = "3.17")
# BiocManager::install(c("dada2","ShortRead", "Biostrings"))
# 
# install.packages("usethis")
# install.packages("devtools")
# install.packages("Rcpp")
# devtools::install_github("benjjneb/dada2")

# if (!require("BiocManager", quietly = TRUE)){ #Another way of installing the latest version of dada2
#     install.packages("BiocManager")}
# 
#The following initializes usage of Bioconductor devel version
#BiocManager::install(version='devel', ask = FALSE) #BiocManager 3.17 (dada2 1.28.0) or developer version (dada2 1.29.0)
#BiocManager::install(c("dada2", "ShortRead", "BioStrings"))


# Loading packages
library("usethis")
library("devtools")
library("Rcpp")
library("dada2")
library("ShortRead")
library("Biostrings")  #This will install other packages and dependencies.

packageVersion("dada2") #checking if it is the latest version
packageVersion("ShortRead")
packageVersion("Biostrings")
```

## File preparation: name clean-up

The DADA2 pipeline has the capacity to work with unzipped ".fastq"
files. It is good practice to ensure that all your files names have the
same amount of fields and characters in there names. This ensures that
scripts treat your files equally and you don't accidentally merge files
(e.g. *R1.fastq* + *R2.fastq* = *R1.fastq*). Note that DNA extraction
controls and PCR controls are a little bit different than field samples
and treated as such by using a slightly different script. For this I
have prepared the following scripts.

**Samples** No need to unzip files if `cutadapt` and `dada2` can handle
zipped files.

```{bash}
#!/bin/bash
for file in *.fastq.gz
do
 echo "Unziping"
 gzip -d  "$file"
done

#Rename the files
for file in *.fastq
do
 echo "Renaming"
 newname=$(echo $file | cut -d_ -f1,2,5).fastq
 echo "Renaming $file as $newname"
 mv $file $newname
done
##Script from HPC workshop 2 3/16/2023
##Updated on 6/7/2023 with troubleshooting with ChatGPT. 
#Various iterations offered were not exactly what
# I needed.-BAR
```

**Notes on the name clean-up for Aponte_8756_23110702 run:**

-   Repeated samples were concatenated.

    -   SPLB_L004_1_R1.fastq

    -   SPLB_L004_2_R1.fastq

    -   SPLB_L004_1_R2.fastq

    -   SPLB_L004_2_R2.fastq

**Controls** No need to unzip files if `cutadapt` and `dada2` can handle
zipped files.

```{bash}
#!/bin/bash
for file in *.fastq.gz
do
 echo "Unziping"
 gzip -d  "$file"
done

#Rename the files
for file in *.fastq
do
 echo "Renaming"
 newname=$(echo $file | cut -d_ -f1,4 | sed 's/-//g').fastq
 echo "Renaming $file as $newname"
 mv $file $newname
done
##Script from HPC workshop 2 3/16/2023
##Updated on 6/7/2023 with troubleshooting with ChatGPT. 
#Various iterations offered were not exactly what
# I needed.-BAR
```

When working from Mac OS **bash** and Linux **console** make the shell
scripts executable

```{bash}
chmod +x FILENAME.sh
```

Put them in your PATH by placing it in my \~/bin directory and adding
the following line to \~/.bashrc:

```{bash}
export PATH=$HOME/bin:$PATH
```

This can also be done through the `miniconda3` command-prompt.

These files live in the same directory as the samples and controls,
respectively. The were executed through the MinGW-64 terminal that
RStudio provides. It emulates a UNIX/POSIX environment on Windows. This
is an alternative in Windows OS or you can establish a virtual machine
with [Virtual Box](https://www.virtualbox.org/) and install a Linux OS
(i.e. Ubuntu) and proceed the same. In Mac OS this is not necessary.

## Repository and file paths

Make various directories beforehand to keep your file output organized.
Starting with `raw_sequences`, where *raw_sequences* is the directory
name of where the main `fastq` or `fasta` files are. This is were the
raw sequence files are. The file names have already been cleaned, see
sections above. Create `filtN` as a sub-directory of `raw_sequences`.
The final directories: `ASV_tables`, `preprocess`, `qc_reports`, and
`taxonomy`. As you work through this tutorial the files will be saved in
their respective directories.

For this project, I will be working with two separate sequencing runs:
*Aponte_8450_23052601* and *Aponte_8756_23110702*. Both contain samples.
They had to be split due to a large number of samples and how laboratory
work was conducted. The document will be structure by the two run
numbers.

```{r, Paths}
#| eval: true
#| echo: false
#| tidy: true

#Aponte_8450_23052601
path_1 <- "/home/baponte/Boxx/Dissertation/Mimulus/Data/CH2/bioinformatics/Aponte_8450_23052601"
out_dir <- "/home/baponte/Boxx/Dissertation/Mimulus/Data/CH2/bioinformatics"


list.files(path_1)
fnFs_seq1 <- sort(list.files(path_1, pattern = "R1.fastq.gz", full.names = TRUE))
fnRs_seq1 <- sort(list.files(path_1, pattern = "R2.fastq.gz", full.names = TRUE))

#Aponte_8756_23110702
path_2 <- "/home/baponte/Boxx/Dissertation/Mimulus/Data/CH2/bioinformatics/Aponte_8756_23110702"


list.files(path_2)
fnFs_seq2 <- sort(list.files(path_2, pattern = "R1.fastq.gz", full.names = TRUE))
fnRs_seq2 <- sort(list.files(path_2, pattern = "R2.fastq.gz", full.names = TRUE))
```

# Aponte_8450_23052601

Software and package versions used as of 03/FEB/2024: - FastQC
(v0.12.1-4) -\> installed through [BioArchLinux
repository](https://github.com/BioArchLinux/Packages) - MultiQC
(v1.19-1) -\> install through the Arch Linux User Repository (AUR). -
Cutadapt (v4.6-1) -\> installed through [BioArchLinux
repository](https://github.com/BioArchLinux/Packages)

### FastQC reports: raw sequences

Inspect your sequence data before jumping in to cut and trim. You want
to get a sense of how the sequencing run went and what you need to do in
downstream processes. [Why do quality
control?](https://www.bioinformatics.babraham.ac.uk/training/Sequence_QC_Course/Sequencing%20Quality%20Control.pdf)

Assuming you have installed FastQC and MultiQC, go to the directory were
your sequence files (.fastaq.gz) are and generate reports. This can be
through the bash command-line in Mac and Linux or the Miniconda3
command-prompt.

-   [FastQC how to:](https://www.youtube.com/watch?v=9_l_hWESuCQ) Mac
    and Linux OS

```{bash}
#!/bin/bash
mkdir qc_reports #Change to  your preferred directory name.

#Concatenate all forward (R1) and reverse (R2) reads 
cat in_directory/*_R1.fastq >> out_directory/all_R1_rawreads.fastq
cat in_directory/*_R2.fastq >> in_directory/all_R2_rawreads.fastq

#Execute fastqc on files
fastqc qc_reports/all_R1_rawreads.fastq -o qc_reports/all_R1_rawreads
fastqc qc_reports/all_R2_rawreads.fastq -o qc_reports/all_R2_rawreads

#This can take a while. It is generating .html reports and associated compressed files. Execute within the directory or from.
```

-   Repeat as necessary for number of sequencing runs.

Windows (miniconda3) In windows you can use a GUI to select the file you
want to create a report. It is not fully support through the
command-line. Make sure to use the `miniconda3` command-prompt and that
it is installed properly.

```{bash}
fastqc *.fastq.gz
```

How do can we interpret these reports? What kind of sequence data do I
have? Does this look OK? See here: + Galaxy Training!: [Quality
Control](https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html) +
[EDAMAME
tutorial](https://github.com/edamame-course/FastQC/blob/master/for_review/2016-06-22_FastQC_tutorial.md)

In the Van Bael lab we produce 16S and ITS amplicon sequence data, for
the most part. Amplicon sequences are considered "low" diversity due
thei enriched proportion of nucleotides (per base sequence). Hence, when
the sequence platform calls nucleotides it tends to be poor in the
intiall base pairs.

### MultiQC: raw sequences

This command will search for all FastQC reports and create summary
report of all of them. You can create a shell script and execute in a
similar way as `fastqc`.

```{bash}
#In the directory that you have the reports, execute:
 multiqc .

# To execute in a subdirectory 
multiqc directory/
```

### Identifiying primers

Each project and organism type will have its own set of primers and
adapters. Take the time to figure out which ones you used and their
proper bases and lengths. You should received the data from the
sequencing core demultiplexed since the barcodes (indexes **i5** and
**i7**) are submitted with the order. Here is a list of the most
commonly used in the Van Bael Lab.

-   [VBL Culture
    primers](https://drive.google.com/open?id=0B9v0CdUUCqU5YVhJck1zT1VTZ28&resourcekey=0-1Nyzv3mGzpJLqDvoo-ls9A&usp=drive_fs)
-   [VBL NGS
    primers](https://drive.google.com/open?id=1bUY7dy_JNlkpvzcXcW1ImQAMSckexeSj&usp=drive_fs)

```{r, ITS1f_adapt_ITS2r_adapt}
#| eval: true
#| echo: false
#| tidy: true

FWD<-"CACTCTTTCCCTACACGACGCTCTTCCGATCTCTTGGTCATTTAGAGGAAGTAA" # 5'- 3' Forward ITS1f_adapt modified with the Illumina TruSeq adaptor
nchar(FWD) #Number of primer nucleotides.
REV<-"GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGCTGCGTTCTTCATCGATGC" # 5'- 3' Reverse primer ITS2r_adapt modified with the Illumina TruSeq adaptor
nchar(REV)
```

### Verifying the orientation of the primers

```{r, orientation_primers}
#| eval: true
#| echo: false
#| tidy: true

allOrients <- function(primer){# Create all orientations of the input sequence
                   require(Biostrings)
                   dna <- DNAString(primer)  # The Biostrings works w/ DNAString objects rather than character vectors
                   orients <- c(Forward = dna, 
                                Complement = Biostrings::complement(dna), 
                                Reverse = Biostrings::reverse(dna),
                                RevComp = Biostrings::reverseComplement(dna))
                   return(sapply(orients, toString))  # Convert back to character vector
}
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
#FWD2 <- FWD.orients[["Complement"]] #Use if you suspect an orientation mix-up.
#REV2 <- REV.orients[["Complement"]]

```

### Filter and Trim Reads for ambiguous bases (N)

We are filtering sequences for presence of ambiguous bases (N) before
cutting primerd off. This pre-filtering step improves maping of short
primer sequences. No other filtering is performed.

```{r, filtering_ambiguous_bases_N}
#| eval: true
#| echo: false
#| tidy: true

fnFs_seq1_filtN <- file.path(path_1, "filtN", basename(fnFs_seq1)) # Put N-filtered forward read files in filtN/ subdirectory
fnRs_seq1_filtN <- file.path(path_1, "filtN", basename(fnRs_seq1)) #Reverse reads

# filterAndTrim(fnFs_seq1, fnFs_seq1_filtN, 
#               fnRs_seq1, fnRs_seq1_filtN, 
#               maxN = 0, multithread = TRUE)
```

### Checking for primer hits

Have primers been removed?

```{r, primer_hits}
#| eval: true
#| echo: false
#| tidy: true

set.seed(123)
#Once this has been completed there is no need to run again when working on the script

primerHits <- function(primer, fn) {
    # Counts number of reads in which the primer is found
    nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
    return(sum(nhits > 0))
}

rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs_seq1_filtN[[1]]), 
      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs_seq1_filtN[[1]]), 
      REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs_seq1_filtN[[1]]), 
      REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs_seq1_filtN[[1]]))
```

We see the reverse complement of primers present in the FWD.ReverseReads
and the Rev.ForwardReads.

### Cutadapt: removal of primers

In previous runs of the pipeline we have encountered a
`Warning: Zero-length sequences detected during dereplication` or the
function `plotQualityProfile` not being able to plot well. These
problems are due to zero-length sequences. Here we use the cut adapt
tool to discard reads less than 20 bp. In later steps, we discard those
less than 50 bp.

```{r, cutadapt}
#| eval: true
#| echo: false
#| tidy: true
#Once this has been completed there is no need to run again when working on the script
cutadapt <-  "/usr/bin/cutadapt" # CHANGE ME to the cutadapt path on your machine
system2(cutadapt, args = "--version") # Run shell commands from R
```

```{r, path_parameters_cuttting}
#| eval: true
#| echo: false
#| tidy: true
path.cut_1 <- file.path(path_1, "cutadapt") #Remember where this "out" directory path leads to.
print(path.cut_1) #Checking if the path is correct.

if(!dir.exists(path.cut_1)) dir.create(path.cut_1)

fnFs_seq1_cut <- file.path(path.cut_1, basename(fnFs_seq1))
fnRs_seq1_cut <- file.path(path.cut_1, basename(fnRs_seq1))

FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)

# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD, "-a", REV.RC) 
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC) 
```

```{r, Running_cutadapt}
#| eval: true
#| echo: false
#| tidy: true
# Run Cutadapt
#Once this has been completed there is no need to run again when working on the script.
# 
#  for(i in seq_along(fnFs_seq1)) {
#    system2(cutadapt, args = c(R1.flags, R2.flags, 
#                               "-n", 2, # -n 2 removes FWD and REV from reads
#                               "-m", 20, # -m 20 removes reads shorter than 20 bp
#                               "-o", fnFs_seq1_cut[i], 
#                               "-p", fnRs_seq1_cut[i], # output files
#                               fnFs_seq1_filtN[i], 
#                               fnRs_seq1_filtN[i])) #input files
#  }

# String for changing file extension (i.e. .fastq > .fa)
# new_extension <- ".fa"
# for (i in seq_along(fnFs_seq1)) {
#   output_file1 <- str_replace(fnFs_seq1_cut[i], "\\.fastq", new_extension)
#   output_file2 <- str_replace(fnRs_seq1_cut[i], "\\.fastq", new_extension) This can be added in the code above if desired.
```

### Re-inspecting if all primers were removed.

```{r, Re-inspect_primer_presence}
#| eval: true
#| echo: false
#| tidy: true
#Once this has been completed there is no need to run again when working on the script
#
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs_seq1_cut[[1]]), 
      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs_seq1_cut[[1]]), 
      REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs_seq1_cut[[1]]), 
      REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs_seq1_cut[[1]]))


# Forward and reverse fastq filenames have the format:

cutFs_seq1 <- sort(list.files(path.cut_1, pattern = "R1.fastq.gz", full.names = TRUE))
cutRs_seq1 <- sort(list.files(path.cut_1, pattern = "R2.fastq.gz", full.names = TRUE))

#allcutF <- sort(list.files(all.cut, pattern = "R1.fastq.gz", full.names = TRUE))

# Extract sample names, assuming filenames have format:
get.sample.name <- function(fname) strsplit(basename(fname), "_R")[[1]][1] #String in commas needs to be updated according to naming convention. If you have multiple underscores in the name then select the underscore next to the "R", like above, or any other unique identifier in the character string.

sample.namesF <- unname(sapply(cutFs_seq1, get.sample.name))
sample.namesR <- unname(sapply(cutRs_seq1, get.sample.name))
head(sample.namesF)
head(sample.namesR)
```

**All primers were successfully removed.**

### Inspect the read quality

The `dada2` package provides a way to visualize this with the
`plotQualityProfile()`function. This will plot the quality scores of
reads per sample. You can also create a concatenated file of all the
forward or reverse reads to evaluate them in one plot. Plotting a
concatenated file may take a while to plot or it may fail. Another way
of inspecting the quality of the reads if using
[FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
which has a GUI options or a command-line approach. Both ways are good.
The FastQC approach works better with concatenated files and outputs a
lot more information. This step is to inform you of the quality of the
reads and decide how to cut, trim and truncate your samples. [FastQC in
Linux environment](https://www.youtube.com/watch?v=5nth7o_-f0Q)

```{r}
#| eval: true
#| echo: false
plotQualityProfile(cutFs_seq1[6:2])
plotQualityProfile(cutRs_seq1[6:2])
```

The quality of the reads improves after removal of N calls and adapter
sequences. The `plotQualityProfile` function allows only a subset of the
reads to be plotted. This is not as useful when you have a large number
of samples and you want to inspect the quality of the reads. We will
create new FastQC reports and MultiQC reports to inspect the quality of
the reads and establish which parameters to use for filtering, trimming
and truncation.

### FastQC reports: mismatches (Ns) filtered and cut

See previous [FastQC reports: raw sequences] section. The output should
show an improvement in the quality of the reads.

### MultiQC repots: mismatches (Ns) filtered and cut

Se previous [MultiQC: raw sequences] section. The out put should include
all cut and trimmed FastQC report. It will inform the following steps:
trimming and truncating.

<!-- ### FIGARO: tool for deciding what parameters to use for filtering, trimming and truncation -->

<!-- This step can be performed on the raw reads as well. Here we focus on -->

<!-- the cut, filtered and trimmed reads. -->

<!-- ```{python} -->

<!-- # from figaro import figaro -->

<!-- # resultTable, forwardCurve, reverseCurve = figaro.runAnalysis( -->

<!-- #   sequenceFolder = path.cut,  -->

<!-- #   ampliconLength = 500, #Maximum expected size of the amplicon -->

<!-- #   forwardPrimerLength= 54, -->

<!-- #   reversePrimerLength = 54,  -->

<!-- #   minimumOverlap = 20,  -->

<!-- #   fileNamingStandard,  -->

<!-- #   trimParameterDownsample,  -->

<!-- #   trimParameterPercentile) -->

<!-- ``` -->

## Filter and trim

Results from MultiQC and FIGARO inform parameter selection.

For Aponte_8450_23052601 we see that the forwards reads have a decent
quality up to 200-230 bp.
<!-- The reverse reads are of lower quality. We will use a `truncLen` of 210 and 200 for forward and reverse reads, respectively. -->
With further thought, we believe that the best parameters for the
truncation of the reads will be with the use of `maxEE`. These serves as
a primary quality filter as opposed to a size selection filter. See in
this [post](https://github.com/benjjneb/dada2/issues/232) and in [Edgar
and Flyvberg 2015](https://doi.org/10.1093/bioinformatics/btv401). We
will use a `maxEE` and `trunQ` of 2 for both forward and reverse reads.
This is a little relaxed but we are working with poor quality reads. We
will also use a `minLen` of 50 bp and maxN of 0.

The previous parameters where a quality of 2 (Phred score 20) in
arguments `trunQ` and `minQ`, which translates to a read length of about
215 bp, which is decent. We are being a little stringent here since the
quality of these libraries has been poor since the PCR stages. The
average ITS amplicon size for this library pool was 385 bp (see Duke
report and MultiQC report in repository). Around 230 bp is where the
average read drops below a Phred score of \~25.

```{r, assign_file_names}
#| eval: true
#| echo: false
#| tidy: true
filtFs_seq1 <- file.path(out_dir, "filtered/filt_8450", basename(cutFs_seq1))
filtRs_seq1 <- file.path(out_dir, "filtered/filt_8450", basename(cutRs_seq1))
names(filtFs_seq1) <- sample.namesF
names(filtRs_seq1) <- sample.namesR
head(filtFs_seq1, 10)
head(filtRs_seq1, 10)
```

```{r, truncating}
#| eval: true
#| echo: false
#| tidy: true
# Truncating Forward and Reverse reads to ~ 230 bp based on trunQ = 2. 
# The reverse maxEE is relaxed due to overall low quality of reads
out_seq1 <- filterAndTrim(cutFs_seq1, filtFs_seq1, 
                          cutRs_seq1, filtRs_seq1,
                          #truncLen=c(210,200), #Truncate reads after truncLen bases. Reads shorter than this are discarded.
                          truncQ = 2,
                          #minQ = 2,
                          maxN = 0,
                          maxEE = c(2,2),
                          minLen = 50, #Not necessary when using truncLen
                          rm.phix = TRUE,
                          compress = TRUE,
                          multithread = TRUE) # minLen: Remove reads with length less than minLen. minLen is enforced after trimming and truncation. #enforce min length of 50 bp

#Once it is completed there is no need to run again unless you are changing parameters.

#Saving file
#saveRDS(out_seq1, file.path(out_dir, "preprocess/pre_8450/out_8450.rds"))

#Loading file
out_seq1 <- readRDS(file.path(out_dir, "preprocess/pre_8450/out_8450.rds"))
```

### Dereplication

Dereplication combines all identical reads into one unique sequence with
a corresponding abundance equal to the number of reads with that unique
sequence. It is done because it reduces computation time by eliminating
redundancy -- From DADA2 tutorial, v1.8

```{r, dereplication}
#| eval: true
#| echo: false
#| tidy: true
derepFs_seq1 <- derepFastq(filtFs_seq1, n = 1000, verbose = TRUE) #n prevents it from reading more than 1000 reads at the same time. This controls the peak memory requirement so that large fastq files are supported. 
derepRs_seq1 <- derepFastq(filtRs_seq1, n = 1000, verbose = TRUE)

#Save file
#saveRDS(derepFs_seq1, file.path(out_dir, "preprocess/pre_8450/derepFs_8450.rds"))
#saveRDS(derepRs_seq1, file.path(out_dir, "preprocess/pre_8450/derepRs_8450.rds"))

#Load file
derepFs_seq1 <- readRDS(file.path(out_dir, "preprocess/pre_8450/derepFs_8450.rds"))
derepRs_seq1 <- readRDS(file.path(out_dir, "preprocess/pre_8450/derepRs_8450.rds"))

#name the dereplicated reads by the sample names
names(derepFs_seq1) <- sample.namesF
names(derepRs_seq1) <- sample.namesR
```

### Learn error rates from dereplicated reads

```{r, error_rates}
#| eval: true
#| echo: false
#| tidy: true

set.seed(123)

errF_seq1 <- learnErrors(derepFs_seq1, randomize = TRUE, multithread = TRUE) #multithread is set to FALSE in Windows. Unix OS is =TRUE.
 
errR_seq1 <- learnErrors(derepRs_seq1, randomize = TRUE, multithread = TRUE)

# Save file
#saveRDS(errF_seq1, file.path(out_dir, "preprocess/pre_8450/errF_8450.rds"))
#saveRDS(errR_seq1, file.path(out_dir, "preprocess/pre_8450/errR_8450.rds"))

# Load file
#errF <- readRDS(file.path(out_dir, "preprocess/pre_8450/errF_8450.rds"))
#errR <- readRDS(file.path(out_dir, "preprocess/pre_8450/errR_8450.rds"))

# Plot errors
plotErrors(errF_seq1, nominalQ = TRUE)
plotErrors(errR_seq1, nominalQ = TRUE)

```

### Sample Inference

```{r, inference1}
#| eval: true
#| echo: false
#| tidy: true

dadaFs_seq1 <- dada(derepFs_seq1, err = errF_seq1, multithread = TRUE)
dadaRs_seq1 <- dada(derepRs_seq1, err = errR_seq1, multithread = TRUE)

# Save file
#saveRDS(dadaFs_seq1, file.path(out_dir, "preprocess/pre_8450/dadaFs_8450.rds"))
#saveRDS(dadaRs_seq1, file.path(out_dir, "preprocess/pre_8450/dadaRs_8450.rds"))

# Load file
#dadaFs_seq1 <- readRDS(file.path(out_dir, "preprocess/pre_8450/dadaFs_8450.rds"))
#dadaRs_seq1 <- readRDS(file.path(out_dir, "preprocess/pre_8450/dadaRs_8450.rds"))
```

### Merge paired reads

```{r, merge}
#| eval: true
#| echo: false
#| tidy: true
mergers_seq1 <- mergePairs(dadaFs_seq1, filtFs_seq1, 
                           dadaRs_seq1, filtRs_seq1, 
                           minOverlap = 20, 
                           maxMismatch = 0,
                           verbose = TRUE)

# Save file
#saveRDS(mergers_seq1, file.path(out_dir, "preprocess/pre_8450/mergers_8450.rds"))

# Load file
mergers_seq1 <- readRDS(file.path(out_dir, "preprocess/pre_8450/mergers_8450.rds"))

# Inspect the merger data.frame from the first sample
head(mergers_seq1[[1]])
```

### Construct ASV Sequence Table

We can now construct an amplicon sequence variant table (ASV) table, a
higher-resolution version of the OTU table produced by traditional
methods.

```{r, ASV}
#| eval: true
#| echo: false
#| tidy: true

seqtab_seq1 <- makeSequenceTable(mergers_seq1)
dim(seqtab_seq1)

# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab_seq1)))

#Save file R object, and .csv
#saveRDS(seqtab_seq1, file.path(out_dir, "clean_data/ASV_tables/01-ASV_8450_seqtable_raw.rds")) #Functions to write a single R object to a file, and to restore it.
#write.csv(seqtab_seq1, file.path(out_dir, "clean_data/ASV_tables/01-ASV_8450_seqtable_raw.rds"))

#Open from here in case R crashes
seqtab <- readRDS(file.path(out_dir, "clean_data/ASV_tables/01-ASV_8450_seqtable_raw.rds"))
```

### Remove chimeras

The core `dada` method corrects substitution and indel errors, but
chimeras remain.

```{r, chimeras}
#| eval: true
#| echo: false
#| tidy: true

seqtab_seq1_nochim <- removeBimeraDenovo(seqtab_seq1, 
                                         method = "consensus", 
                                         multithread = FALSE, 
                                         verbose = TRUE) #Multithread = FALSE in Windows

#Frequency of chimeras
sum(seqtab_seq1_nochim)/sum(seqtab_seq1)

#Save file
#saveRDS(seqtab_seq1_nochim, file.path(out_dir, "clean_data/ASV_tables/02-ASV_8450_seqtable_nochim_denoise.rds"))
#write.csv(seqtab_seq1_nochim, file.path(out_dir, "clean_data/ASV_tables/02-ASV_8450_seqtable_nochim_denoise.csv")) # Long file name but it indicates this file has gone through all the steps in the pipeline.

seqtab_seq1_nochim <- readRDS(file.path(out_dir,"ASV_tables/02-ASV_8450_seqtable_nochim_denoise.rds"))
```

Inspect distribution of sequence lengths

```{r, eval=TRUE}
table(nchar(getSequences(seqtab_seq1_nochim)))
```

### Track reads through the pipeline

We now inspect the the number of reads that made it through each step in
the pipeline to verify everything worked as expected.

```{r, pipeline_tracking}
#| eval: true
#| echo: false
#| tidy: true

getN_seq1 <- function(x) sum(getUniques(x))

track_seq1 <- cbind(out_seq1, 
                    sapply(dadaFs_seq1, getN_seq1), 
                    sapply(dadaRs_seq1, getN_seq1), 
                    sapply(mergers_seq1, getN_seq1),
                    rowSums(seqtab_seq1_nochim))

# If processing a single sample, remove the sapply calls: e.g. replace
# sapply(dadaFs, getN) with getN(dadaFs)
colnames(track_seq1) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track_seq1) <- sample.namesF
head(track_seq1)

# Save file
#saveRDS(track_seq1, file.path(out_dir, "preprocess/pre_8450/track_8450.rds"))

# Load file
#track <- readRDS(file.path(out_dir, "preprocess/pre_8450/track_8450.rds"))
```

#### Tracking summary

Doesn't look good. We lose a majority of the reads in the filter and
trim step. This is a large drop-off in reads. Current parameters using
`trunQ = 2` and `maxEE = 2` are more stringent than
`truncLen = c(230, 180)` used on June 2023 preliminary analyses.

::: callout-warning
<font size="4"> [**Note from DADA2 ITS
tutorial**](https://benjjneb.github.io/dada2/ITS_workflow.html).

**Considerations for your own data:** This is a great place to do a last
sanity check. Outside of filtering (depending on how stringent you want
to be) there should no step in which a majority of reads are lost. If a
majority of reads were removed as chimeric, you may need to revisit the
removal of primers, as the ambiguous nucleotides in un-removed primers
interfere with chimera identification. If a majority of reads failed to
merge, the culprit could also be un-removed primers, but could also be
due to biological length variation in the sequenced ITS region that
sometimes extends beyond the total read length resulting in no overlap.
:::

## Taxonomy assignment

Congratulations! You've made it to a checkpoint in the pipeline. If you
have save the ASV tables, specially after removing chimeras, if not go
and do that. This section can take a big toll on your local machine. It
is best to perform in the Van Bael Lab Mac or HPC Cypress.

Download the latest "full" [UNITE
release](https://unite.ut.ee/repository.php). This will serve as your
reference for assigning taxonomy. Use the appropriate data base to
assing taxonomy to your data or project!

```{r, taxonomy}
#| eval: true
#| echo: false
#| tidy: true

unite.ref <- file.path(out_dir, "clean_data/taxonomy/sh_general_release_dynamic_s_25.07.2023.fasta")  # CHANGE ME to location on your machine

taxa_seq1 <- assignTaxonomy(seqtab_seq1_nochim, 
                            unite.ref,
                            multithread = TRUE, 
                            tryRC = TRUE) #Multithread = FALSE in Windows. TRUE in Mac/Linux.


#Loading from the files saved. In case it crashes, we start from here.
#seqtab.nochim2 <- readLines(file.path(out_dir, "output.txt")) 
#seqtab.nochim2 <- read.csv(file.path(out_dir, "clean_data/ASV_tables", "/ASV_nochim_denoise_filt.csv")) 
#seqtab.matrix <- as.matrix(seqtab.nochim2) #assignTaxonomy needs a vector matrix
## unqs <- lapply(fn, getUniques)
# seqtab <- makeSequenceTable(unqs)
# dim(seqtab)
```

Inspecting the taxonomic assignments:

```{r, taxa_inspection}
#| eval: true
#| echo: false
#| tidy: true

taxa.print_seq1 <- taxa_seq1  # Removing sequence rownames for display only
rownames(taxa.print_seq1) <- NULL
head(taxa.print_seq1)

# Save file
#write.csv(taxa.print_seq1, file.path(out_dir, "clean_data/taxonomy/01-assigned_tax_8450.csv"))
```

Done!

You have successfully taken the sequences through the DADA2 pipeline.
You can do the same for your samples. This was a small number of samples
and your local machine can handle it. When you obtain your sequence
files from the sequencing core it is about 25 Gb of data. It is best to
work on this through the Cypress HPC cluster. Let's move on to that.

# Aponte_8756_23110702

Remember to change the file paths to your files and directories. As
well, as to where `cutadapt` is installed in your local machine or HPC
cluster. Use the appropriate data base to assigning taxonomy to your
data or project! Notice that `multithread = TRUE` here to take advantage
of HPC's capacity to run jobs in parallel.

Due to the low quality of Aponte_8756_23110702 sequences, we will
attempt to merge sequences first then filter out chimeras. This is a
different approach from the Aponte_8450_23052601 sequences. But it is an
attempt to use these sequences in downstream analyses. We have made
several attempts at relaxing the `maxEE` parameters to 5 and 7 for
forward and reverse reads, respectively, but only two samples actually
pass the filter. We will also use a `maxEE` and `trunQ` of 2 for both
forward and reverse reads. This approach would go like this: Derep \>
Merge \> Filter and trim \> Dereplicate \> Learn error rates \> Sample
inference \> Construct ASV Sequence Table \> Remove chimeras \> Track
reads through the pipeline \> Taxonomy assignment

```{r}
#DADA2 pipeline
#Modified by Bolívar Aponte Rolón for bioinformatic analyses of ITS amplicon sequences
#14/june/2023

# Loading packages
# Activate commands as needed.
# DADA2 package and associated
library("usethis") #Doesn't install on the HPC cluster for some reason.
library("devtools") #Doesn't install on the HPC cluster for some reason.
library("Rcpp")
library("dada2")
library("ShortRead")
library("Biostrings")  #This will install other packages and dependencies.


### File Paths
#Aponte_8756_23110702
path_2 <- "/home/baponte/Boxx/Dissertation/Mimulus/Data/CH2/bioinformatics/Aponte_8756_23110702"
out_dir <- "/home/baponte/Boxx/Dissertation/Mimulus/Data/CH2/bioinformatics"


list.files(path_2)
fnFs_seq2 <- sort(list.files(path_2, pattern = "R1.fastq.gz", full.names = TRUE))
fnRs_seq2 <- sort(list.files(path_2, pattern = "R2.fastq.gz", full.names = TRUE))

# Identifying primers
#CHANGE primers accordingly
FWD<-"CACTCTTTCCCTACACGACGCTCTTCCGATCTCTTGGTCATTTAGAGGAAGTAA"# Forward ITS1f_adapt from IDT 
nchar(FWD) #Number of primer nucleotides.

REV<-"GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGCTGCGTTCTTCATCGATGC"# Reverse primer ITS2r_adapt from IDT
nchar(REV)

### Verifying the orientation of the primers
allOrients <- function(primer) {# Create all orientations of the input sequence
                   require(Biostrings)
                   dna <- DNAString(primer)  # The Biostrings works w/ DNAString objects rather than character vectors
                   orients <- c(Forward = dna, 
                                Complement = Biostrings::complement(dna), 
                                Reverse = Biostrings::reverse(dna),
                                RevComp = Biostrings::reverseComplement(dna))
                   return(sapply(orients, toString))  # Convert back to character vector
}

FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
#FWD2 <- FWD.orients[["Complement"]] #Use if you suspect an orientation mix-up.
#REV2 <- REV.orients[["Complement"]]
```

Contrary to the Aponte_8450 sequences, the Aponte_8756 sequencing run
has a lot of N call in the first 20-30 bp. We will trim out these first
bases and then filter out any reads with N calls. This is important
because `dada` does not support N calls. This will potentially remove
some of the primers present. We will remove them in subsequent steps.

```{r}
### Filter Reads for ambiguous bases (N)
fnFs_seq2_filtN <- file.path(path_2, "filtN", basename(fnFs_seq2)) # Put N-filtered forward read files in filtN/ subdirectory
fnRs_seq2_filtN <- file.path(path_2, "filtN", basename(fnRs_seq2)) #Reverse reads

# filterAndTrim(fnFs_seq2, fnFs_seq2_filtN,
#               fnRs_seq2, fnRs_seq2_filtN,
#               trimLeft = c(20,20), #Trim 20 bases from the 5' end of each read
#               maxN = 0, multithread = TRUE) #multithread = TRUE on Mac OS, FALSE in Windows
```

```{r}
### Checking for primer hits
set.seed(123)
primerHits <- function(primer, fn) {
  # Counts number of reads in which the primer is found
  nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
  return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs_seq2_filtN[[1]]),
      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs_seq2_filtN[[1]]),
      REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs_seq2_filtN[[1]]),
      REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs_seq2_filtN[[1]]))

### Cutadapt: removal of primers
#Once this has been completed there is no need to run again when working on the script

cutadapt <-  "/usr/bin/cutadapt" # CHANGE ME to the cutadapt path on your machine
system2(cutadapt, args = "--version") # Run shell commands from R

path.cut_2 <- file.path(path_2, "cutadapt") #Remember where this "out" directory path leads to.
#all.cut <- file.path(out_dir, "FastQC") # Path to concatenated files 
print(path.cut_2) #Checking if the path is correct.

if(!dir.exists(path.cut_2)) dir.create(path.cut_2)
fnFs_seq2_cut <- file.path(path.cut_2, basename(fnFs_seq2))
fnRs_seq2_cut <- file.path(path.cut_2, basename(fnRs_seq2))

FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)

# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD, "-a", REV.RC) 
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC) 

### Run Cutadapt
# for(i in seq_along(fnFs_seq2)) {
#   system2(cutadapt, args = c(R1.flags, R2.flags,
#                              "-n", 2, # -n 2 removes FWD and REV from reads
#                              "-m", 20, # -m 100 removes reads shorter than 100 bp
#                              "-o", fnFs_seq2_cut[i],
#                              "-p", fnRs_seq2_cut[i], # output files
#                              fnFs_seq2_filtN[i], 
#                              fnRs_seq2_filtN[i])) #input files
# }

#Re-inspecting if all primers were removed.  
#Once this has been completed there is no need to run again when working on the script
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs_seq2_cut[[1]]), 
      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs_seq2_cut[[1]]), 
      REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs_seq2_cut[[1]]), 
      REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs_seq2_cut[[1]]))
                                                                                                        
#Forward and reverse fastq filenames have the format:
cutFs_seq2 <- sort(list.files(path.cut_2, pattern = "R1.fastq.gz", full.names = TRUE))
cutRs_seq2 <- sort(list.files(path.cut_2, pattern = "R2.fastq.gz", full.names = TRUE))
                                          
#Extract sample names, assuming filenames have format:
get.sample.name <- function(fname) strsplit(basename(fname), "_R")[[1]][1] #String in commas needs to be updated according to naming convention. If you have multiple underscores in the name then select the underscore next to the "R", like above, or any other unique identifier in the character string.
              
sample.namesF <- unname(sapply(cutFs_seq2, get.sample.name))
sample.namesR <- unname(sapply(cutRs_seq2, get.sample.name))
          
### Inspect the read quality
plotQualityProfile(cutFs_seq2[6:2])
plotQualityProfile(cutRs_seq2[2:2])

### FastQC and MultiQC reports: mismatches (Ns) filtered and cut
# See sections above.
```

```{r, filter_trim}
### Filter and trim
filtFs_seq2 <- file.path(out_dir, "filtered/filt_8756", basename(cutFs_seq2))
filtRs_seq2 <- file.path(out_dir, "filtered/filt_8756", basename(cutRs_seq2))
names(filtFs_seq2) <- sample.namesF
names(filtRs_seq2) <- sample.namesR

#Truncating Forward and Reverse using maxEE = 2 and trunQ =2. The reverse maxEE is relaxed due to overall low quality of reads
out_seq2 <- filterAndTrim(fnFs_seq2, filtFs_seq2, 
                          fnRs_seq2, filtRs_seq2,
                          #truncLen=c(210, 200),
                          #truncQ = 2,
                          #minQ = 2,
                          maxN = 0,
                          maxEE = c(8,8),
                          minLen = 50, #Minimun length of 50 bp, Not necessary when using truncLen
                          rm.phix = TRUE,
                          compress = TRUE,
                          multithread = TRUE) # minLen: Remove reads with length less than minLen. minLen is enforced after trimming and truncation. #enforce min length of 50 bp


#Once it is completed there is no need to run again unless you are changing parameters.

#Save file
#saveRDS(out_seq2, file.path(out_dir, "preprocess/pre_8756/out_8756.rds"))

# Load file
#out_seq2 <- readRDS(out_seq2, file.path(out_dir, "preprocess/pre_8756/out_8756.rds"))
```

```{r, dereplication2}
### Dereplication

derepFs_seq2 <- derepFastq(filtFs_seq2, n = 1000, verbose = TRUE) #n prevents it from reading more than 1000 reads at the same time. This controls the peak memory requirement so that large fastq files are supported. 
derepRs_seq2 <- derepFastq(filtRs_seq2, n = 1000, verbose = TRUE)

# Save file
#saveRDS(derepFs_seq2, file.path(out_dir, "preprocess/pre_8756/derepFs_8756.rds"))
#saveRDS(derepRs_seq2, file.path(out_dir, "preprocess/pre_8756/derepRs_8756.rds"))

# Load file
#derepFS_seq2 <- readRDS(file.path(out_dir, "preprocess/pre_8756/derepFs_8756.rds"))
#derepRS_seq2 <- readRDS(file.path(out_dir, "preprocess/pre_8756/derepRs_8756.rds"))

# name the dereplicated reads by the sample names
names(derepFs_seq2) <- sample.namesF
names(derepRs_seq2) <- sample.namesR

### Learn error rates from dereplicated reads
set.seed(123)
errF_seq2 <- learnErrors(derepFs_seq2, randomize = TRUE, multithread=TRUE) #multithread is set to FALSE in Windows. Unix OS is =TRUE.
errR_seq2 <- learnErrors(derepRs_seq2, randomize = TRUE, multithread=TRUE)

# Save file
#saveRDS(errF_seq2, file.path(out_dir, "preprocess/pre_8756/errF_8756.rds"))
#saveRDS(errR_seq2, file.path(out_dir, "preprocess/pre_8756/errR_8756.rds"))

# Load file
#errF_seq2 <- readRDS(file.path(out_dir, "preprocess/pre_8756/errF_8756.rds"))
#errR_seq2 <- readRDS(file.path(out_dir, "preprocess/pre_8756/errR_8756.rds"))

### Plot errors

#plotErrors(errF_seq2, nominalQ = TRUE)

### Sample Inference
dadaFs_seq2 <- dada(derepFs_seq2, 
                    err = errF_seq2, 
                    multithread=TRUE)

dadaRs_seq2 <- dada(derepRs_seq2, 
                    err = errR, 
                    multithread=TRUE)

# Save file
#saveRDS(dadaFs_seq2, file.path(out_dir, "preprocess/pre_8756/dadaFs_8756.rds"))
#saveRDS(dadaRs_seq2, file.path(out_dir, "preprocess/pre_8756/dadaRs_8756.rds"))

# Load file
#dadaFS <- readRDS(file.path(out_dir, "Preprocess", "/dadaFs.rds"))
#dadaRS <- readRDS(file.path(out_dir, "Preprocess", "/dadaRs.rds"))


### Merge paired reads
mergers_seq2 <- mergePairs(dadaFs_seq2, filtFs_seq2,
                           dadaRs_seq2, filtRs_seq2, 
                           minOverlap = 20, 
                           maxMismatch = 0, 
                           verbose=TRUE)

#saveRDS(mergers_seq2, file.path(out_dir, "Preprocess/pre_8756/mergers8756.rds"))

#mergers <- readRDS(file.path(out_dir, "Preprocess", "/mergers.rds"))
# Inspect the merger data.frame from the first sample
head(mergers_seq2[[1]])


### Construct ASV Sequence Table
seqtab_seq2 <- makeSequenceTable(mergers_seq2)
dim(seqtab)

# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab_seq2)))

# Save file R object, and .csv
#saveRDS(seqtab_seq2, file.path(out_dir, "clean_data/ASV_tables/ASV_seqtab_8756.rds")) #Functions to write a single R object to a file, and to restore it.
#write.csv(seqtab, file.path(out_dir, "clean_data/ASV_tables/ASV_seqtab_8756.rds"))

# Open from here in case R crashes
#seqtab<- readRDS(file.path(out_dir, "clean_data/ASV_tables", "/ASV_seqtab.rds"))

### Remove chimeras
seqtab_seq2_nochim <- removeBimeraDenovo(seqtab_seq2,
                                         method="consensus",
                                         multithread = TRUE,
                                         verbose=TRUE) #Multithread = FALSE in Windows

# Frequency of chimeras
sum(seqtab_seq2_nochim)/sum(seqtab_seq2)

# Save file
#saveRDS(seqtab_seq2_nochim, file.path(out_dir, "clean_data/ASV_tables/ASV_seqtab_nochim_8756.rds"))
#write.csv(seqtab_seq2_nochim, file.path(out_dir, "ASV_tables/ASV_nochim_denoise_filt.csv")) # Long file name but it indicates this file has gone through all the steps in the pipeline.
#seqtab.nochim <- readRDS(file.path(out_dir, "clean_data/ASV_tables", "/ASV_seqtab_nochim.rds"))

# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab_seq2_nochim)))

### Track reads through the pipeline
getN_seq2 <- function(x) sum(getUniques(x))
track_seq2 <- cbind(out_seq2, sapply(dadaFs_seq2, getN_seq2), sapply(dadaRs_seq2, getN_seq2), sapply(mergers_seq2, getN_seq2),
rowSums(seqtab_seq2nochim))

colnames(track_seq2) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track_seq2) <- sample.namesF
head(track_seq2)

# Save file
#saveRDS(track_seq2, file.path(out_dir, "preprocess/pre_8756/track_8756.rds"))

### taxonomy
unite.ref <- file.path(out_dir, "clean_data/taxonomy/sh_general_release_dynamic_29.11.2022.fasta")  # CHANGE ME to location on your machine
taxa_seq2 <- assignTaxonomy(seqtab_seq2_nochim, unite.ref,
                            multithread = TRUE, 
                            tryRC = TRUE) #Multithread = FALSE in Windows. TRUE in Mac/Linux.

#Loading from the files saved. In case it crashes, we start from here.
#seqtab.nochim2 <- read.csv(file.path(out_dir, "clean_data/ASV_tables", "/ASV_nochim_denoise_filt.csv")) 
#seqtab.matrix <- as.matrix(seqtab.nochim2) #assignTaxonomy needs a vector matrix

# Inspecting the taxonomic assignments:
taxa.print_seq2 <- taxa_seq2  # Removing sequence rownames for display only
rownames(taxa.print_seq2) <- NULL
head(taxa.print_seq2)

# Save file
#write.csv(taxa.print_seq2, file.path(out_dir, "clean_data/taxonomy/assigned_tax_8756.csv"))

#Done
```

### Summary of Aponte_8756_23110702

After multiple attempts at relaxing the `maxEE` parameters to 5 and 7,
and 8 and 8, no reads passed the filter. We also attempted an
alternative approach: Derep\> Merge\> Filter and trim\> Dereplicate\>
Learn error rates\> Sample inference\> ... approach. This approache
possed other challenges. The dereplication function requieres FASTQ
files as input. We fed it the cutadapted files but this truned out to be
too much for the computer to handle. It created a 25 Gb object and would
chrash the R session. Even if we were able to dereplicate and merge the
reads, we believe that ultimately, they would have passed the quality
filters imposed by `filterAndTrim`. We will not pursue this approach
further. We will move on to decontaminate Aponte_8450_23052601
sequences.

# Decontamination of samples

Let's clean and "decontaminate" our samples. Hopefully you have
sequenced your samples with DNA extraction controls and PCR controls. We
will decontaminate our samples by contaminant frequency using the
`decontam` package. We will then remove from our samples the average
reads counts found in our controls. Then any ASV that is less than 10%
of the abundance per sample on the assumption that it originates from
contamination throughout handling of samples in the DNA and PCR
processes. Now is when we put these to use. We are also going to use the
`decontam` package to statistically determine which ASVs are likely
contaminants and remove them.

## Decontam pipeline

### Formatting ASV and taxonomic tables

```{r, ASV_TAXA_tables}
library("phyloseq")
library("decontam")
library("microbiome") # Miscellaneous functions for microbiome analysis
library("metagMisc") # Miscellaneous functions for microbiome analysis
library("data.table")
library("tidyverse")
set.seed(123)
out_dir <- file.path("../mim2_bioinformatics")
taxa <- read.csv(file.path(out_dir, "clean_data/taxonomy/01-assigned_tax_8450.csv"))

# Adding "ASV_" to the taxonomical assignment
rownames(taxa)<-paste0("ASV_",1:nrow(taxa))
taxa <- taxa |> 
  select(!X) |>
  setDT(keep.rownames = "ASV_ID")

# Save file
#saveRDS(taxa, file.path(out_dir, "clean_data/taxonomy/01-assigned_tax_8450.rds"))
# Read file
taxa <- readRDS(file.path(out_dir, "clean_data/taxonomy/01-assigned_tax_8450.rds"))


#Sequences (ASV) per sample table 
seqtab.nochim <- readRDS(file.path(out_dir, "clean_data/ASV_tables/02-ASV_8450_seqtable_nochim_denoise.rds"))
samples.out <- rownames(seqtab.nochim)

#Code from Astrobiomike
#Giving our sequences header names (e.g. ASV...)
asv_seqs <- colnames(seqtab.nochim)
asv_headers <- vector(dim(seqtab.nochim)[2], mode="character")

 for (i in 1:dim(seqtab.nochim)[2]) {
  asv_headers[i] <- paste(">ASV", i, sep="_")
}

# Making and writing out a fasta of our final ASV seqs:
asv_fasta <- c(rbind(asv_headers, asv_seqs))
#write(asv_fasta, file.path(out_dir, "clean_data/ASV_tables/03-ASV_8450_amplicons.fa"))

# count table:
asv_tab_raw <- t(seqtab.nochim) #Transposing
row.names(asv_tab_raw) <- sub(">", "", asv_headers) #Substituting sequences for shorthand ASV identifier.

# Save file
#write.table(asv_tab_raw, file.path(out_dir, "clean_data/ASV_tables/04-ASV_8450_counts_raw.tsv") , sep="\t", quote=F, col.names=NA) #Also saved as a .csv
#saveRDS(asv_tab_raw, file.path(out_dir, "clean_data/ASV_tables/04-ASV_8450_counts_raw.rds"))

# Load file
asv_tab_raw <- readRDS(file.path(out_dir, "clean_data/ASV_tables/04-ASV_8450_counts_raw.rds"))
```

We now have a table with ASV ID's for each of our samples. Now we clean
and decontaminate

### Phyloseq objects for decontamination

Phyloseq joins various objects that we have already prepare: taxonomic
table, ASV table and our sample data.

It is important to remove contaminants from the corresponding sample
batch (e.g. PCR plate). Each DNA extraction and PCR plate have their
"contamination" events (i.e. people coughing or mishandling of samples).
Here we will remove contaminants of a PCR plate basis. This means that
all the DNA controls in the plate and the negative PCR control will be
used to remove contaminant from the whole plate. Another way of doing it
is removing the contamination found in each control and matching it to
the samples it was extracted with.

```{r, DNA_PCR_controls}
#| echo: false
#| eval: true
#| tidy: true

#Make subset tables from each sequence plate run
#This way we check if all our samples are in the dataset and remove contamination according to PCR plate. The filter tables are the names of each sample in long format exactly as they were sent off for sequencing. Thes must match in order to remove contamination from the samples.

## Plate1 ##
filter_table1<- read.csv(file.path(out_dir, "clean_data/ASV_tables/P1_Mim8450_filter.csv"))
names(filter_table1)
str(filter_table1)

tasv_tab_raw <- t(asv_tab_raw) #Transposing ASV table

tasv_tab_raw <- tasv_tab_raw |>
  data.frame() |>
  setDT(keep.rownames = "Sample_ID")

plate1 <- semi_join(tasv_tab_raw, filter_table1, by = "Sample_ID") #Somehow plate1 ends up with 95 row instead of 96.

plate1 <- plate1 |>
  mutate(across(!Sample_ID, as.numeric))

filter_filter <- anti_join(filter_table1, plate1, by = c("Sample_ID" = "Sample_ID")) #Corroborating if indeed all match. Supposed to be ZERO. 


## Plate2 ##

filter_table2<- read.csv(file.path(out_dir, "clean_data/ASV_tables/P2_Mim8450_filter.csv"))

names(filter_table2)

plate2 <- semi_join(tasv_tab_raw, filter_table2, by= c("Sample_ID" = "Sample_ID"))

plate2 <- plate2 |>
  mutate(across(!Sample_ID, as.numeric))

filter_filter2 <- anti_join( filter_table2, plate2, by = c("Sample_ID" = "Sample_ID")) #Corroborating if indeed all match.
```

```{r, PQ_objects}
#| echo: false
#| eval: true
#| tidy: true

#ASV table
asv_tab_raw <- asv_tab_raw |>
  as.matrix()
ASV <- otu_table(asv_tab_raw, taxa_are_rows = TRUE)

class(asv_tab_raw) #Should be matrix
taxa_names(ASV) #Should be ASV_#

#Taxonomixc table
taxa <- taxa |>
  column_to_rownames(var = "ASV_ID") |>
  as.matrix()
TAX <- tax_table(taxa) #726 ASVs (raw)

#Sample data
#DNA concentration and sample and control information
#From Mim2_stat_analyses.Rmd
dna_data <- read.csv(file.path(out_dir, "field_data/mim2_dna_pcr.csv"))

dna_data <- dna_data |>
  dplyr::select(c(7, 15, 22)) |>
  filter(!Unique_ID == c("S1", "S2", "#NAME?")) |>
  distinct() |>
  slice(-197) |> #Removes one sneaky sample.
  mutate(concentration = as.numeric(DNA_conc._ng.µl)) |>
  as.data.frame()

#Some samples and controls had zero or "too low" concentration. We will code them as 0.001.
dna_data$concentration[is.na(dna_data$concentration)] <- 0.001#Replace NA with 0
dna_data$concentration[dna_data$concentration == 0] <- 0.001

# #Sample data. Only keeping samples that have sequence data for ASV analyses.
#getting rownmes to filter
plates <- bind_rows(plate1, plate2)
names <- column_to_rownames(plates, var = "Sample_ID")
samples <- rownames(names) #Using the colname to filter out

dna_data <- dna_data |>
  filter(Unique_ID %in% samples) |>
  column_to_rownames(var = "Unique_ID")
  #slice(-c(29,105:106)) #Some control that are still there.
#saveRDS(ftraits, file.path(out_dir, "Statistics", "/ftraits.rds"))

SAMP <- sample_data(dna_data)
class(SAMP)
sample_names(SAMP)


#Phyloseq main object
ps <- phyloseq(ASV, TAX, SAMP)
```

#### Phyloseq RAW summary statistics
```{r, Phyloseq_summary}
dna_data2 <- dna_data |>
  slice(-c(1:11)) |> #Removes one sneaky sample.
  mutate(concentration = as.numeric(DNA_conc._ng.µl)) |>
  as.data.frame()

SAMP2 <- sample_data(dna_data2)

ps2 <- phyloseq(ASV, TAX, SAMP2) #Just SAMP2 as the new phyloseq object

#How many ASVs per Phyla are there before decontamination? Represent by percentage
phyloseq_summary(ps2, more_stats = F, long = F) #Overall summary

## This summary does not include the relative abundance of phyla represented in the total reads. We will calculate this below. ###

# Relative abundance of Phyla based on the total reads
raw_relabundance_phyla <- tax_glom(ps2, 
                 taxrank = "Phylum", 
                 NArm = FALSE, bad_empty=c(NA, "", " ", "\t")) |>
  phyloseq_to_df() |>
  dplyr::select(-OTU) |> #Remove OTU column. It represents the first ASV of Phylum
  mutate(total_reads_phy = rowSums(across(where(is.numeric))),
         relabun_reads_phy = total_reads_phy/sum(total_reads_phy)*100) |> #Relative abundance of phyla
  dplyr::select(Kingdom, Phylum, total_reads_phy, relabun_reads_phy) #Select columns of interest)


# How many samples with ASVs per phyla are there before decontamination?
percent_phyla_raw <- phyloseq_ntaxa_by_tax(
  ps2,
  TaxRank = "Phylum",
  relative = F,
  add_meta_data = F
) |>
  as.data.frame() |>
  mutate(sum = sum(N.OTU)) |>
  group_by(Phylum) |>
  summarise(occurance_in_samples = n())# Count the number of ASV per phylum

# Micorobiome packge summary
summarize_phyloseq(ps2) #Microbiome package summary
```

**Check which sample are actually lost in the cleaning and
decontaminating process. Some samples don't have enough reads. It's
worth knowing which samples are those.**


### Identifying contaminants by frequency in samples

```{r}
ps
head(sample_data(ps))

# Inspecting library sizes
df <- as.data.frame(sample_data(ps)) # Put sample_data into a ggplot-friendly data.frame
df$LibrarySize <- sample_sums(ps)
df <- df[order(df$LibrarySize),]
df$Index <- seq(nrow(df))
ggplot(data=df, aes(x=Index, y=LibrarySize, color=sample_control)) + geom_point()
```

The library sizes of Aponte_8450_23052601 are very small. Many samples
have low read counts, just as much as negative controls. This is not
great. It potentially means that there is not a lost of endophytes in
the tissue in the first place.

```{r contaminant_frequency}
#| echo: false
#| eval: true
#| tidy: true

contamdf.freq <- isContaminant(ps, method="auto", conc="concentration")
head(contamdf.freq)

table(contamdf.freq$contaminant)

head(which(contamdf.freq$contaminant))

# Plotting contaminant frequency
plot_frequency(ps, taxa_names(ps)[c(1,396)], conc="concentration") +
  xlab("DNA Concentration")
# I cannot get it to plot correctly. It is not plotting the frequency of contaminants due to:
# Warning: Removed 2 samples with zero total counts (or frequency).Error in data.frame(..., check.names = FALSE) : 
#   arguments imply differing number of rows: 183, 185
```

Regardless of the error, we can see that there are four contaminant ASVs
that are present in the samples. We will remove these from the dataset.
and proceed with "in-house" decontamination.

We can approach this via the `decontam` and `phyloseq` packages
following the instructions from the [Introductions to
decontam](https://benjjneb.github.io/decontam/vignettes/decontam_intro.html).
The problem is that we can't then remove contaminants the "in-house"
way.

```{r}
#| echo: false
#| eval: true
#| tidy: true

# Remove contaminants decontam way
ps.noncontam <- prune_taxa(!contamdf.freq$contaminant, ps)
ps.noncontam

# Remove contaminants in-house way
#Now that we know which ASVs are likely contaminants, we can remove them from the dataset.

# Remove contaminants from ASV table
asv_tab_raw <- asv_tab_raw |>
  data.frame() |>
  slice(-c(85, 309, 352, 396)) |>
  as.matrix()

class(asv_tab_raw) #Should be matrix
taxa_names(ASV) #Should be ASV_#
```

## In-house decontamination pipeline

*Adapted from Marie and Shuzo Oita fron the Arnold Lab, U.of Arizona*

Now that we removed the frequent ASVs that are likely contaminants, we
will remove accumulated cross-contamination in samples during the DNA
extraction and PCR process. This method works by substracting the SUM of
contaminants (ASVs found in controls and PCR blanks) from each ASV cell.
If the value is negative it replaces it with 0. It substracts the sum of
the average of the DNA controls and PCR negative controls per plate,
then removes control rows.


### Contaminant removal using DNA and PCR controls

```{r, Plate1}
#| echo: false
#| eval: true
#| tidy: true

# PLATE 1
as.data.frame(plate1)
plate1 <- column_to_rownames(plate1, var = "Sample_ID")

# Add column with sums for each OTU
cont <- row.names(plate1)
cont <- cont[c(15:19,56:57)] #change accordingly to your data - these are the negative controls
contamination <- c()
for(c in 1:ncol(plate1)){
  contamination[c]<- mean(plate1[rownames(plate1) %in% cont, c], na.rm = TRUE)#uses data.table syntax
}
plate1 <-rbind(plate1, contamination)
row.names(plate1)[96] <- "contamination" #change the name of row 104

# Subtract total contaminants from each ASV, if it is a negative number make it 0
cont2 <- c(cont, "contamination")
row <- which(!rownames(plate1) %in% cont2)
for(r in row){
  for(c in 1:ncol(plate1)){
    if(plate1[r,c] > plate1["contamination",c]) {
      new_reads <- plate1[r,c] - plate1["contamination",c]
      plate1[r,c] <- new_reads
    } else {plate1[r,c] <- 0}
  }
}

# Remove controls from dataframe and makes a text file to paste into existing excel sheet
decontaminated_p1 <- plate1[!rownames(plate1) %in% cont2, ]
#write.table(decontaminated_p1, file.path(out_dir,"clean_data/ASV_tables/P1_Mim8450_negsblanks.tsv"), sep="\t") 
```

```{r, Plate2}
#| echo: false
#| eval: true
#| tidy: true

# PLATE 2

as.data.frame(plate2)
plate2 <- column_to_rownames(plate2, var = "Sample_ID")

#add column with sums for each OTU

cont <- row.names(plate2)
cont <- cont[c(17:22, 65:66)] #change accordingly to your data - these are the negative controls
contamination <- c()
for(c in 1:ncol(plate2)){
  contamination[c]<- mean(plate2[rownames(plate2) %in% cont, c], na.rm = TRUE)
}
plate2 <-rbind(plate2, contamination)
row.names(plate2)[97] <- "contamination" #change the name of row

###subtract total contaminants from each ASV, if it is a negative number make it 0

cont2 <- c(cont, "contamination")
row <- which(!rownames(plate2) %in% cont2)
for(r in row){
  for(c in 1:ncol(plate2)){
    if(plate2[r,c] > plate2["contamination",c]) {
      new_reads <- plate2[r,c] - plate2["contamination",c]
      plate2[r,c] <- new_reads
    } else {plate2[r,c] <- 0}
  }
}


# Remove controls from dataframe and makes a text file to paste into exisitng excel sheet
decontaminated_p2 <- plate2[!rownames(plate2) %in% cont2, ]
#write.table(decontaminated_p2, file.path(out_dir,"clean_data/ASV_tables/P2_Mim8450_negsblanks.tsv"), sep="\t")
```

### Removal ASV sequences that represent less than 0.1% of the read in samples.

```{r, removing_10percent}
#| echo: false
#| eval: true
#| tidy: true

##Saving decontaminated files as one.

all_decontaminated <- bind_rows(decontaminated_p1, decontaminated_p2)
#write.csv(all_decontaminated,file.path(out_dir , "clean_data/ASV_tables/05-ASV_8450_decont_blankandnegs.csv"))

asv.decon <- t(all_decontaminated)
asv.decon <- as.data.frame(asv.decon)
#write.csv(asv.decon, file.path(out_dir , "clean_data/ASV_tables/05-ASV_8450_toFilt_10_percent.csv")) ## in the format to remove <10% OTUs -- see below

#### Script to remove < 0.10% abundance per sample #### From Shuzo Oita
####################################################
## csv file should have row = OTU, col = sample


asv.data <- apply(asv.decon, 2, function(x) ifelse({100*x/sum(x)} < 0.10, 0, x))
asv.data <- as.data.frame(asv.data)


asv_10 <- asv.decon[rowSums(asv.decon)>1,] # Removes ASVs based on 10% of all the reads in the sample
#otu.10 <- otu.def[,colSums(otu.def) > 10] #Removes ASV's with less than 10 reads. The other code 

#Rounding columns. ASV reads are either there or not there, not half there. Hence the use of integer. An ASV that has 1.5 reads doesn't make sense. 

i <- colnames(asv_10)
asv_10[ , i] <- apply(asv_10[ , i],2,
                    function(x) round(x, digits = 0))

#write.csv(asv_10, file.path(out_dir, "clean_data/ASV_tables/06-ASV_8450_cleaned_10_percent.csv"))
#saveRDS(asv_10, file.path(out_dir, "clean_data/ASV_tables/06-ASV_8450_cleaned_10_percent.rds"))
asv_10 <- readRDS(file.path(out_dir, "clean_data/ASV_tables/06-ASV_8450_cleaned_10_percent.rds"))
```

We end up with 688 ASVs in 176 samples after removing those that
represent less than 0.1% of all the reads in a sample. Let's recreate
the phyloseq object with the cleaned ASV table.

```{r, clean_phyloseq}
#| echo: false
#| eval: true
#| tidy: true

# Phyloseq object with cleaned ASV table
ASV <- otu_table(asv_10, taxa_are_rows = TRUE)

# We have removed contaminants from our ASV table, now we will remove them from our taxonomic table.
#Removing contaminants from Taxonomic table
asv_filter <- rownames(asv_10) #Using our cleaned ASV table to filter out the taxonomic table

taxa <- taxa |>
  data.frame() |>
  #rownames_to_column(var = "ASV_ID") |>
  filter(ASV_ID %in% asv_filter) |>
  column_to_rownames(var = "ASV_ID") |>
  as.matrix()
TAX <- tax_table(taxa) #688 ASVs (cleaned)

#Sample data
# The sample data here is the leaf traits measurements collected from the field. The file can be found in the statistics folder.
plant_traits <- readRDS(file.path(out_dir, "clean_data/statistics/plant_traits.rds"))

plant_traits <- plant_traits |>
  column_to_rownames(var = "Unique_ID") #Making sure that the Unique_ID is the rowname

SAMP <- sample_data(plant_traits)

# Phyloseq object with cleaned ASV table
ps_clean <- phyloseq(ASV, TAX, SAMP)

# Checking the dimensions of the object we see that we lost two samples
#Somewhere in this joining of phyloseq object two samples are lost. We need to find out which ones they are and why they are lost.
lost <- setdiff(sample_names(ASV), sample_names(SAMP)) #This is the difference between the two sets of samples.
print(lost) #These are the samples that are lost. They are duplicate samples from the same plant. We proceed without them. 

# Save phyloseq object
#saveRDS(ps_clean, file.path(out_dir, "clean_data/taxonomy/phyloseq_clean_8450.rds"))
```

Our `ps_clean` object now has 688 ASVs and 174 samples.

### Removal of singletons

Code modified from Mareli Sánchez Juliá.

```{r, singleton_removal}
#| echo: false
#| eval: true
#| tidy: true

# Are there any taxa with no (0) reads?
ps_clean_2 <- prune_taxa(taxa_sums(ps_clean) > 0, ps_clean)
ntaxa(ps_clean_2) # No taxa removed. 688 taxa in 174 samples.

# Filtering Taxa: Removal of singletons ####
 # Removal of  singletons
ps_clean_3 <- filter_taxa(ps_clean_2, function (x) {sum(x > 0) > 1}, prune=TRUE)
ntaxa(ps_clean_3) # The result is 231 taxa in 174 samples.


# Relative abundance calculation ####
rel_abund_ps_clean <- transform_sample_counts(ps_clean_3, function(x)x/sum(x))
# microbiome::transform(ps_clean_3, "compositional") #This is the same as the line above.

#Eliminating samples that have 0 reads of any taxa
ps_clean_3 <- prune_samples(sample_sums(ps_clean_3) > 0, ps_clean_3)
rel_abund_ps_clean <- prune_samples(sample_sums(rel_abund_ps_clean ) > 0, rel_abund_ps_clean) 
# Result: 231 taxa in 160 samples.

# Save file
#saveRDS(ps_clean_3, file.path(out_dir, "clean_data/taxonomy/02-TAXA_8450_phyloseq_nonsingletons.rds"))


# Load file
ps_clean_3 <- readRDS(file.path(out_dir, "clean_data/taxonomy/02-TAXA_8450_phyloseq_nonsingletons.rds"))
```

#### Phyloseq CLEANED summary statistics

```{r, Phyloseq_clean_summary}
# How many ASVs per Phyla are there after decontamination? Represent by percentage
# metagMisc summary
phyloseq_summary(ps_clean_3, more_stats = F, long = F) #Overall summary

# Microbiome package summary
summarize_phyloseq(ps_clean_3) #Microbiome package summary


## This summary does not include the relative abundance (compositional) of phyla represented in the total reads. We will calculate this below. ###

# Relative abundance of Phyla based on the total reads
clean_relabundance_phyla <- tax_glom(ps_clean_3, 
                                     taxrank = "Phylum", 
                                     NArm = FALSE, bad_empty=c(NA, "", " ", "\t")) |>
  metagMisc::phyloseq_to_df() |>
  select(-OTU) |> #Remove OTU column. It represents the first ASV of Phylum
  mutate(total_reads_phy = rowSums(across(where(is.numeric))),
         relabun_reads_phy = total_reads_phy/sum(total_reads_phy)*100) |> #Relative abundance of phyla
  select(Kingdom, Phylum, total_reads_phy, relabun_reads_phy) #Select columns of interest)


# How many samples with ASVs per phyla are there AFTER decontamination?
percent_phyla_clean <- phyloseq_ntaxa_by_tax(
  ps_clean_3,
  TaxRank ="Phylum",
  relative = F,
  add_meta_data = T
) |>
  as.data.frame() |>
  mutate(sum = sum(N.OTU)) |>
  group_by(Species, Phylum) |>
  summarise(occurance_in_samples = n())# Count the number of ASV per phylum

# Coverage
phyloseq_coverage(ps_clean_3, correct_singletons = F) #Coverage

# Prevalence plot
phyloseq_prevalence_plot(
  ps_clean_3,
  prev.trh = 0.5,
  taxcolor = "Phylum",
  facet = TRUE,
  point_alpha = 0.7,
  showplot = T
)
```

The removal of singletons and samples with no reads of any taxa resulted
in 231 ASVs in 160 samples. We have lost 14 samples in the process. It looks like good coverage for all/many of the samples.

We see that Ascomycota is present in 147 samples, Basidiomycota in 155 samples. The rest of the phyla are present in less than 6 samples.


### Phyloseq objects to data frames

We are going to save the `phyloseq` object as a data frame fro
downstream analyses.

```{r, phyloseq_to_df}
# Phyloseq object to data frames 
# Cleaned: no singletons
# Saving phyloseq with not relative abundance calculated
ps_clean_3_df <- metagMisc::phyloseq_to_df(ps_clean_3, addtax = T, addtot = F, addmaxrank = F, sorting = "abundance") |>
  rename(ASV = OTU)
write.csv(ps_clean_3_df,file.path(out_dir, "clean_data/taxonomy/02-TAXA_8450_phyloseq_nonsingletons.csv"))

# Relative abundance calculated
rel_ps_clean_df <- metagMisc::phyloseq_to_df(rel_abund_ps_clean, addtax = T, addtot = F, addmaxrank = F, sorting = "abundance") |>
  rename(ASV = OTU)

#Replacing forced NaNs when turned into df
rel_ps_clean_df  <- replace(rel_ps_clean_df , rel_ps_clean_df  == "NaN", 0) 
write.csv(rel_ps_clean_df, file.path(out_dir, "clean_data/taxonomy/02-TAXA_8450_relabun_nonsingletons.csv"))


# Remove species B from the phyloseq object
bicolor <- c("SDSB_B001", "SDSB_B002", "SDSB_B005")
Sample <- rownames(sample_data(ps_clean_3@sam_data))

ps_clean_3 <- subset_samples(ps_clean_3, !Sample %in% bicolor)
rel_abund_ps_clean <- subset_samples(rel_abund_ps_clean, !Sample %in% bicolor)


ps_clean_3_df <- metagMisc::phyloseq_to_df(ps_clean_3, addtax = T, addtot = F, addmaxrank = F, sorting = "abundance") |>
  rename(ASV = OTU)

saveRDS(ps_clean_3, file.path(path, "clean_data/taxonomy/02-TAXA_8450_phyloseq_nonsingletons_noB.rds"))
write.csv(ps_clean_3_df, file.path(path, "clean_data/taxonomy/02-TAXA_8450_phyloseq_nonsingletons_noB.csv"))
# 
saveRDS(rel_abund_ps_clean, file.path(path, "clean_data/taxonomy/02-TAXA_8450_relabun_nonsingletons_noB.rds"))
write.csv(rel_ps_clean_df, file.path(path, "clean_data/taxonomy/02-TAXA_8450_relabun_nonsingletons_noB.csv"))
```

Now we have the objects from `phyloseq` saved as R objects as well as
CSVs. We can treat the data set like any other and visualize with
ggplot. Phyloseq does provide the means of doings this also, but that is
beyond the scope of this notebook.


### Brief exploration of the data set

Bar plots

```{r}
plot_bar(ps_clean_3, x = "sample_Species", fill="Phylum")
  facet_wrap(~sample_Species)
```

Richness

```{r}
plot_richness(ps_clean_3, x="Species", measures=c("Shannon", "Simpson"), color="Site", scales = "free")
```

# Important files resulting from the pipeline

-   asv_10 == clean_data/ASV_tables/06-ASV_8450_cleaned_10_percent.rds
-   ps_clean_3 == clean_data/taxonomy/02-TAXA_8450_phyloseq_nonsingletons.rds
-   ps_clean_3_df == clean_data/taxonomy/02-TAXA_8450_phyloseq_nonsingletons.csv

All of these files and objects underwent the complete pipeline and have
been removed of contaminant ASVs, sample cross-contamination and cut-off
at 10% representation in samples.

# Cypress: How to submit this as a SLURM job.

The scripts to complete this pipeline as a SLUREM job can be found in
`/scripts_notes/DADA2_pipeline`.

See also,general guidelines from [Adrià Auladell
Martín](https://github.com/adriaaula/dada2_guidelines)

## Setting up a virtual environment with anaconda3

See instructions for login into Cypress HPC here:

1.  Navigate to /lustre/project/<your group name>/

Example

```{bash}
[home@cypress01 baponterolonl]$cd /lustre/project/svanbael/
```

Result

```{bash}
[baponterolon@cypress01 svanbael]$ 
```

2.  Initiate a job node and create a "virtual" environment

```{bash}
[baponterolon@cypress01 svanbael]$ idev --partition=centos7 #Access one working node for 60 minutes.

[baponterolon@cypress01-033 svanbael]$module load anaconda3/2020.07

[baponterolon@cypress01-033 svanbael]$conda create -n virtual_env #<name however you want

[baponterolon@cypress01-033 svanbael]$source activate virtual_env #Activate environment

(virtual_env) [baponterolon@cypress01-033 svanbael]$  #You are now in the virtual environment
```

Notice the change in the username of your session indicating that you
are operating inside the "virtual" environment. Creating this
environment allows you to install new packages and dependencies that are
not available in Cypress. In this self-contained environment you can
install a new version of R or any other program necessary for analyses.

## Software and packages on Cypress

Once you are in you desired environment you can proceed to install the
packages and modules necessary. Check the available modules with:

```{bash}
conda list
```

You will see what is available and your environment and decide what need
to be installed.

## Cutadapt

See [Shared installation (on a
cluster)](https://cutadapt.readthedocs.io/en/stable/installation.html#shared-installation-on-a-cluster)
for details. Yoi have already created the virtual environment. Install
using `pip` command:

```{bash}
pip install cutadapt==4.4 #Select version
```

Cutadapt can also be installed using the `bioconda` channel:

```{bash}
conda install -c bioconda cutadapt
```

**Note**:

<p class="text-danger">

Careful. This will result in the installation of `cutadapt` version
1.18.

</p>

## R and packages

installation:

```{bash}
conda search r-base #Search available versions
conda install r-base=4.3.0 #Select version
```

Initiate R session

```{bash}
(virtual_env) [baponterolon@cypress01-033 svanbael]$R #Just type "R"  

(virtual_env) [baponterolon@cypress01-033 svanbael]$ 
R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-conda-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.
>
```

Your command-line should look like an R session.

## Packages

You can install packages in R like you do in your computer with
`install.packages()`. The script below is to install then in an
interactive session on the HPC cluster or through a BATCH job.

```{r}
# Installation of packages requires for DADA2 pipeline
# Bolívar Aponte Rolón
# 14 June 2023


# Activate commands as needed.
# DADA2 package and associated

#install.packages("gert", repos = c(
#  ropensci = 'https://ropensci.r-universe.dev',
#  CRAN = 'https://cloud.r-project.org')) #usethis needs gert Doesn't install on the HPC cluster for some reason.
#install.packages("usethis") #Doesn't install on the HPC cluster for some reason.
#install.packages("devtools") #Doesn't install on the HPC cluster for some reason.
#devtools::install_github("benjjneb/dada2") #change the ref argument to get other versions

install.packages("Rcpp")

# if (!require("BiocManager", quietly = TRUE)){
#   install.packages("BiocManager", repo="http://cran.rstudio.com/")}
# BiocManager::install(version = "3.17") #Version 3.17
# BiocManager::install(c("dada2", "ShortRead", "Biostrings"))

if (!require("BiocManager", quietly = TRUE)){ #Another way of installing the latest version of dada2
  install.packages("BiocManager", repo="http://cran.rstudio.com/")}

BiocManager::install(version='devel', ask = FALSE) #BiocManager 3.17 (dada2 1.28.0) or developer version (dada2 1.29.0)
BiocManager::install(c("dada2", "ShortRead", "BioStrings"))

packageVersion("dada2") #checking if it is the latest version
packageVersion("ShortRead")
packageVersion("Biostrings")
```

The script for a batch job goes something like this: Ex. 1

```{bash}
#!/bin/bash

#SBATCH --qos=normal
#SBATCH --job-name MIM2_8450
#SBATCH --error DIRECTORY/<FILENAME>.error          
#SBATCH --output DIRECTORY/<FILENAME>.output  
#SBATCH --time=23:00:00
#SBATCH --mem=64000 #Up to 256000K
#SBATCH --nodes=1              # One is enough. When running MPIs,anywhere from 2-4 should be good.
#SBATCH --ntasks-per-node=20    # Number of Tasks per Node
#SBATCH --cpus-per-task=20`
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<USER>@tulane.edu
#SBATCH --partition=centos7    # This is important to run the latest software versions

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK #$SLURM_CPUS_PER_TASK stores whatever value we assign to cpus-per-task, and is therefore our candidate for passing to OMP_NUM_THREADS

##Modules/Singularity
module load anaconda3/2020.07
source activate virtual_env

#Run R script
R CMD BATCH dada2_mim_pipeline.R # Add your script here

module purge

```

Because each project has different number of samples or target organisms
each SLURM job is different and will have different memory and
processing parameters. Adjust accordingly. See this resource:
[dada2HPC_worload](https://github.com/erictleung/dada2HPCPipe#slurm-workload-manager)

## Assigning taxonomy

The last step in this pipeline can be consume a lot of processing
memory, even for the HPC cluster. If your job fails at the last step,
assigning taxonomy, consider submitting this part as separate job.

### Taxonomy batch job

With these parameters it took about \~4.5 hrs. Maybe tasks per node can
be increase to 10, hence 2 CPU's per task.

```{bash}
#!/bin/bash

#SBATCH --qos=normal
#SBATCH --job-name MIM2_8450_tax
#SBATCH --error /lustre/project/svanbael/bolivar/CH2_sequences/MIM2_8450_tax.error     
#SBATCH --output /lustre/project/svanbael/bolivar/CH2_sequences/MIM2_8450_tax.output  
#SBATCH --time=23:00:00
#SBATCH --mem=256000 #Up to 256000
#SBATCH --nodes=2               # One is enough. When running MPIs,anywhere from 2-4 should be good.
#SBATCH --ntasks-per-node=1    # Number of Tasks per Node
#SBATCH --cpus-per-task=20      # Number of threads per task (OMP threads)
#SBATCH --mail-type=ALL
#SBATCH --mail-user=baponterolon@tulane.edu
#SBATCH --partition=centos7    #This is important to run the latest software versions

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK #$SLURM_CPUS_PER_TASK stores whatever value we assign to cpus-per-task, and is therefore our candidate for passing to OMP_NUM_THREADS

##Modules/Singularity
module load anaconda3/2020.07
source activate virtual_env

#Run R script
R CMD BATCH dada2_tax_pipeline.R

module purge
```

The memory requirements are different than the batch job submitted for
the whole pipeline. It has increased RAM memory core (256000k), 20 CPUs
per task, and 1 task per node. This means that one task has a lot of
processing memory to operate with. The previous batch job example
carried out 20 tasks at a time in 20 CPUs, throughout 4 nodes. This
makes the computation of the pipeline quite fast (\~2hrs), but it fails
to locate enough processing memory for the `assingtaxonomy()` step.
Another way of going about this issue is submitting a [job
array](https://wiki.hpc.tulane.edu/trac/wiki/Workshops/cypress/ManyTaskComputing).