-
Notifications
You must be signed in to change notification settings - Fork 24
DeSeq2
Use secure copy to transfer the counts file and a sampleTable file. The sample table file is at:
/lustre/haven/proj/UTK0138/apricot_data/sampleTable.txt
We're going to process some data with DESeq2 in R. Open R Studio, and do installs of anything you don't have:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
library( "DESeq2" )
install.packages("pheatmap")
install.packages("RColorBrewer")
install.packages("ggplot2")
library("pheatmap")
library("RColorBrewer")
library("ggplot2")
Set the working directory - this tells DESeq2 where to find input files and were to write output files. You'll need to figure out your own directory on your laptop
setwd("/Users/margaretstaton/Desktop/apricot")
Lets import our counts files:
sampleTable <- read.table("sampleTable.txt", header=TRUE, colClasses=c("character", "character", "factor","factor"))
And make sure DESeq2 likes our formatting:
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = '.', design = ~ time)
Prefiltering very lowly expressed reads, which will make everything else run faster
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
Do the differential expression analysis (we'll learn more about this in the next lab, right now we just want the software to accept all the data so we can explore the data a bit).
dds <- DESeq(dds)
res <- results(dds)
res
Check dispersion plot
plotDispEsts(dds)
Now we actually need to transform the data again. From the vignette: "In order to test for differential expression, we operate on raw counts and use discrete distributions as described in the previous section on differential expression. However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data."
Lets go with rlog.
rld <- rlog(dds, blind=FALSE)
Sample to sample distances
sampleDists <- dist(t(assay(rld)))
sampleDistMatrix <- as.matrix(sampleDists)
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)
Generate a PCA
plotPCA(rld, intgroup=c("tree", "time"))
Fix up the colors and shapes to be more easily understood by using ggplot2:
pcaData <- plotPCA(rld, intgroup=c("tree", "time"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=time, shape=tree)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed()
Lets go back to investigating our results
summary(res)
For easy access to the genes of interest, let's create a dataframe with only the genes with an adjusted p value less than 0.05.
res_sig <- as.data.frame(res[ which(res$padj < 0.05),])
head(res_sig)
dim(res_sig)
summary(res_sig)
Lets sort the dataframe so we can see the most significant genes:
res_sig <- res_sig[order(res_sig$padj),]
To see the 10 most significant genes, we can do a head and request 10 lines:
head(res_sig, n=10)
You can add gene annotation to your data frame, making it easier to see what each gene is. You'll need to download the annotation file:
Save it into your working directory.
Here's the code:
## Add gene annotation info
# read annotation info file in
annotat <- read.csv("Ppersica_298_v2.1.annotation_info.txt", header = TRUE, comment.char = '', sep='\t')
# keep first row for each unique gene in locusName, removing transcript isoforms
annotat <- annotat[!duplicated(annotat[,2]),]
head(annotat)
# remove '.v2.1' in the gene name to match the locus name in the annotation file
rownames(res_sig) <- gsub('.v2.1','',rownames(res_sig))
# Merge annotation content to DEGs
res.anno <- merge(res_sig, annotat, by.x = "row.names", by.y = "locusName", all.x = TRUE, all.y = FALSE)
## write annotated results to table
write.table(res.anno, file="DEG_sig.txt", sep = "\t", row.names = F)