11-SqueezeMeta.Rmd

# Illumina Metatranscriptomics (SqueezeMeta)

## What is SqueezeMeta?

SqueezeMeta is a fully automatic pipeline for metagenomics/metatranscriptomics. It includes multi-metagenome support that enables co-assembly of related metagenomes and retrieval of individual genomes[1].

## SqueezeMeta Workflow
![workflow](./Figures/SqueezeMeta_workflow.png){width=100%}
Figure 1: Workflow of the three modes of operation of SqueezeMeta: sequential, co-assembly and merged. Starting from metagenomic samples, green, blue and red arrows indicate main steps in sequential, merged and co-assembly modes. All modes create two of the three main results tables: open reading frame (ORF) and contig tables. Co-assembly and merged modes also apply binning and, therefore, they also create the bin table.

## Tutorial Introduction
Over half of adults experience gingivitis, a mild yet treatable form of periodontal disease caused by the overgrowth
of oral microbes. Left untreated, gingivitis can progress to a more severe and irreversible disease, most
commonly chronic periodontitis. While periodontal diseases are associated with a shift in the oral microbiota
composition, it remains unclear how this shift impacts microbiota function early in disease progression. Metatranscriptome
sequencing analysis indicates that during the early stages of transition to gingivitis, a number of
virulence-related transcripts were significantly differentially expressed in individual and across pooled patient
samples[2].

In this tutorial we will analyze metatranscriptomic data from this study to identify gene function that may be associated
with disease progressions. The publication for this data is located here https://www.ncbi.nlm.nih.gov/pmc/articles/18.pdf

![teeth](./Figures/gingivitis.png){width=100%}
Figure 2: Study design and visualization of the progression from health to periodontal disease. On the left, the covered or uncovered teeth depict the study design utilized, in which an acrylic stent (shown in blue) was worn to cover either the entire top or bottom set of teeth during brushing throughout the course of the experiment, 2 weeks. The images on the right illustrate the clinical symptoms associated with gingivitis that were scored by trained dental professionals in this study. An modified gingival index (MGI) score of 0 represents a healthy tooth with no indication of inflammation (shown by increasing redness at the gum) or plaque (tan color on tooth). Healthy periodontia progress through various degrees of gingivitis as depicted by the MGI 1, MGI 2, and MGI 3 panels and can eventually progress to the destructive gum disease, chronic periodontitis, shown on the far right.

Many of the steps have been completed beforehand. However, instructions and code are provided so that you
can build the database and format the metadata that we will use with SqueezeMeta.

## Conda Environment

```{bash, eval=FALSE}
# Do not do for this tutorial

conda create -n squeezemeta -c bioconda -c fpusan squeezemeta sra-tools
```

## Retrieve Databases

SqueezeMeta provides a couple scripts that will allow you to download the NCBI taxonomy tree and RefSeq
reference genomes.

```{bash, eval=FALSE}
# Make a new directory and enter it.

mkdir squeezemeta
cd squeezemeta

# Let's start a screen:
screen -S squeezemeta

# Download from source and format the latest version of the annotation databases for use with SqueezeMeta.
# Do not do this step for this tutorial (It took ~14 hours).

make_databases.pl /home/$USER/squeezemeta/database
```

## Data Download

• This was done already but we will practice downloading data from SRA. Download the metadata from SRA

• Copy the following link into a web browser https://www.ncbi.nlm.nih.gov/bioproject/PRJNA387475

• Select the RNA Seq data with the following SRA IDs and send to the Run selector.

SRR5787581  
SRR5787582  
SRR5787583  
SRR5787584  
SRR5787585  
SRR5787586

• Export the metadata to your computer and copy to the server.
Or copy it from here: /home/data/metagenomics-2310/squeezemeta/SraRunTable.txt

```{bash, eval=FALSE}
# Copy accessions list:
cp /home/data/metagenomics-2310/squeezemeta/sra_ids.txt .

# Activate sra-tools environment:
conda activate sra-tools

# Run while creating metadata:
while read line; do fasterq-dump -e 3 -O ./fastq/ ${line}; done < sra_ids.txt

# While that runs, from the file downloaded, parse the metadata to extract the SRR IDs and sample names. This will be need to create sample_file.txt
awk -F',' '{if ($0 !~"^Run") print $1,$27}' SraRunTable.txt | sort -t_ -k2 -k1 > sample_groups.txt

# Create the sample_file.txt file from the sample_groups.txt file
awk -F' ' '{print $2"\t"$1"_pass.fastq.gz""\t""pair1"}' \
sample_groups.txt > sample_file.txt

# Create a file with just the SRR accession IDs
awk '{print $1}' sample_groups.txt > sra_ids.txt
```

• Run SqueezeMeta. The top command runs SqueezeMeta using an assembly based method for metatranscriptomics. Unfortunately,
the assembly based method my not be appropriate for this dataset. After assembly, less than 50% of the
reads aligned to the assembly. The bottom command uses the sqm_read.pl script to perform a read based method for
metatranscriptomics and may be more suited for this dataset.

```{bash, eval=FALSE}

# Deactivate sra-tools, activate squeezemeta environment
conda deactivate
source activate squeezemeta

# Create a symbolic link to fastq files if download didn't have time to finish
cd /home/$USER/squeezemeta/fastq
ln -s /home/elavelle/squeezemeta/fastq/new/*gz ./

# This is an assembly based method of metatranscriptomics (Finished running in ~4.25 hours)
cd /home/$USER/squeezemeta
SqueezeMeta.pl -m coassembly -p gingivitis_assembly_based -s sample_file.txt \
-f /home/$USER/squeezemeta/fastq -a megahit -map minimap2-sr

# This is read based method of metatranscriptomics (Finished running in ~48 hours)
sqm_reads.pl -p gingivitis_read_based -s sample_file.txt \
-f /home/$USER/squeezemeta/fastq \
--memory 0.1 --num-cpu-threads 12 --euk

# Softlink to pre-run output directory
ln -s /home/elavelle/squeezemeta/gingivitis_2 /home/$USER/squeezemeta/


# Deactivate the environment and launch R
conda deactivate
R
```

## Analysis of SqueezeMeta

• Load SQMtools library and analyze SqueezeMeta data.

• SQMtools manuals https://github.com/jtamames/SqueezeMeta/blob/master/SQMtools_0.5.0.pdf

• SQMtools and R tutorial
https://github.com/jtamames/SqueezeMeta/wiki/Using-R-to-analyze-your-SQM-results

```{R, eval=FALSE}
# Load SQMtools library
library('SQMtools')
# set working directory to directory containing gingivitis_2 folder

setwd('/home/$USER/squeezemeta/')
# Load SqueezeMeta data into R
gingivitis = loadSQM('gingivitis_2')
```
![R_structure](./Figures/R_structure.jpg){width=100%}
```{R, eval=FALSE}
# Access Matrix of genera and their abundances
genus_tax = gingivitis$taxa$genus$abund

# Look at the first 6 genera for all 6 samples
genus_tax[1:6, 1:6]

# Access Matrix of KEGG ids and their abundances
KEGG_table = gingivitis$functions$KEGG$abund

# Look at the first 6 KEGG ids for all 6 samples
KEGG_table[1:6, 1:6]

# What Genera are there?
# Plot the top 15 (N) taxonomies at the genus level
# Replace percent with abund to see raw abundance instead of percentages
# Save plots to file with ggplot2
library('ggplot2')
pdf('genus_abund.pdf', width=10, height=8)
plotTaxonomy(gingivitis, rank='genus', count='percent', N = 15)
dev.off()

# Now, we know something about the microbial diversity in the samples,
# but what are they doing? We should check the functional
# profile(KEGG, COG and PFAM annotations) of the samples:
pdf('KEGG_tpm.pdf', width=14, height=12)
plotFunctions(gingivitis, fun_level ="KEGG", count ="tpm", N = 20)
dev.off()

# Save the count table for later use
write.table(gingivitis$functions$KEGG$abund,"abund.csv",quote=FALSE,sep=",")

# Which functions are more abundant in the project? We use the DESeq2 package to
# analyze which functions are significantly more abundant in
# Load DESeq2 package
library('DESeq2')

# Create the required matrix with the raw abundances, colData (metadata file)
metadata = as.data.frame(c(rep('MGI0',3), rep('MGI2',3)))

# Read in the .csv:
abund <- as.matrix(read.table("abund.csv", header=TRUE, row.names=1, sep=","))

rownames(metadata) = colnames(abund)
colnames(metadata)='condition'

# Verify sample order
all(rownames(metadata)==colnames(abund))

# Convert your data to the format required by DESeq2
dds = DESeqDataSetFromMatrix(countData = abund, colData = metadata,
design = ~ condition)

# Remove low abundant KEGGs:
keep = rowSums(counts(dds)) >= 10
dds = dds[keep,]

# Choose factor levels
dds$condition=factor(dds$condition, levels=c('MGI2','MGI0'))

# Run DESeq2
dds2=DESeq(dds)
results=results(dds2, name='condition_MGI0_vs_MGI2')

# Make and save MA plot
pdf("MA_plot.pdf")
plotMA(results, ylim=c(-2,2))
dev.off()

# Transform data
rld <- rlogTransformation(dds2)

# Create Heatmap based on KEGG expression
sampleDists <- as.matrix(dist(t(assay(rld))))
library(gplots)
pdf("KEGG-heatmap-samples.pdf", w=50, h=50, pointsize=50)
heatmap.2(as.matrix(sampleDists), key=F, trace="none",
col=colorpanel(100, "blue", "white"),
margin=c(30, 30), main="Sample Distance Matrix")
dev.off()

# Principal components analysis
# Can do with built-in DESeq2 function:
pdf("KEGG_pca.pdf")
plotPCA(rld, intgroup=c("condition"))
dev.off()

# Get differential expression results for MGI0 vs MGI2
MGI2_from_MGI0 <- results(dds2, contrast=c("condition", "MGI0", "MGI2"))

# Order by adjusted p-value
MGI2_from_MGI0 <- MGI2_from_MGI0[order(MGI2_from_MGI0$padj), ]

# Merge with normalized count data
resdata <- merge(as.data.frame(MGI2_from_MGI0), as.data.frame(counts(dds2, normalized=TRUE)),
by="row.names", sort=FALSE)
names(resdata)[1] <- "KEGG"
head(resdata, n = 15)

# Write results
write.csv(resdata, file="MGI2_from_MGI0_DE.csv")

# Switch back to squeezemeta's R:
q()
source activate squeezemeta
R

# Now we can subset the SqueezeMeta results and create heatmaps for the most significant KEGGs
top15_sigs = subsetFun(gingivitis, fun = 'K09942|K01775|K11607|K15022|K02647|K05524|K00176|K01267|K03469|K07095|K08987|K11604|K17472|K11645|K16950', fixed = F)
pdf("top_15_sig.pdf")
plotFunctions(top15_sigs, fun_level = "KEGG", count = 'tpm')
dev.off()

# This will throw an error, we'll take a look at my .pdf afterwards.

# To calculate the log-fold change of a KEGG, you can provide a list of vectors.
# Create vectors with sample names by condition (cond):
MGI0.samples = gingivitis$misc$samples[grep('MGI0', gingivitis$misc$samples)]

MGI2.samples = gingivitis$misc$samples[grep('MGI2', gingivitis$misc$samples)]
cond = list(MGI0.samples, MGI2.samples)

# Choose colors, one per condition
colors = c('#006682', '#c26e00') ##006682 = blue , #c26e00 = orange

# Plot log2 fold changes using copy number abundances in the selected KEGG pathway:
# beta-Lactam resistance Ko01501
# The hue of the color depends on the value of the log-fold-change calculated using
# copy numbers: the darkest orange implies a log-fold-change value >= 1.5
#(more abundant in MGI2) and the darkest blue a log-fold-change <= -1.5
#(more abundant in MGI0 samples). See Figure 3 below:
exportPathway(all_sigs, '01501', output_suffix = 'beta_lactam_resis.log2FC',
fold_change_colors = colors, fold_change_groups = cond, count ='copy_number',
max_scale_value = 10)

# Periodontal Pathogens Produce Quorum Sensing Signal Molecules
exportPathway(all_sigs, '02024', output_suffix = 'Quorum_sensing.log2FC',
fold_change_colors = colors, fold_change_groups = cond, count ='copy_number',
max_scale_value = 10)
```

## References
[1] Tamames J, Puente-Sánchez F. SqueezeMeta, A Highly Portable, Fully Automatic Metagenomic Analysis Pipeline. Front
Microbiol. 2019;9:3349. Published 2019 Jan 24. doi:10.3389/fmicb.2018.03349 

[2] Microbiota and Metatranscriptome Changes Accompanying the Onset of Gingivitis Emily M. Nowicki, Raghav Shroff, Jacqueline
A. Singleton, Diane E. Renaud, Debra Wallace, Julie Drury, Jolene Zirnheld, Brock Colleti, Andrew D. Ellington,
Richard J. Lamont, David A. Scott, Marvin Whiteley mBio. 2018 Mar-Apr; 9(2): e00575-18. Published online 2018 Apr 17.
doi: 10.1128/mBio.00575-18 PMCID: PMC5904416.