wget https://raw.githubusercontent.com/mestato/epp622/master/RNA_labs_data/count_data.csv
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
install.packages("colorspace")
require("DESeq2")
Option 1: Use your own sorted bam files from last lab (either from hisat2 mapping or STAR mapping).
Option 2: Copy the sorted bam files from my directory to your home directory.
cd ~
cp -r /data/home/mchen33/RNASeq_lab_2_DESeq ./
This requires internet access, so do this from the head node.
conda install htseq -y
Get an interactive session first:
qrsh
Run command cd ~/RNASeq_lab_2_DESeq
and then create a script file named count_reads.sh
. Put the following content into count_reads.sh
.
#!/bin/bash
## USAGE
## ./count_reads.sh hisat /path/to/the/sorted_bam_file_directory bam_file_ORDER_TYPE
## ./count_reads.sh STAR /path/to/the/sorted_bam_file_directory bam_file_ORDER_TYPE
mkdir counts_$1
for sorted_bam_path in $(find $2 -name *.bam)
do
counts_file=$(echo $sorted_bam_path | grep -o "DRR0161[0-9]*")_$1_ct
echo "The target bam file is: "$sorted_bam_path
echo "==================================================="
htseq-count -f bam \
-t gene \
-i gene_id \
-r $3 \
$sorted_bam_path \
~/RNASeq_lab_2_DESeq/Arabidopsis_thaliana.TAIR10.28.gtf \
| \
grep -v '^__' > ./counts_$1/$counts_file
echo "The count data has been written into: $counts_file"
echo "==================================================="
done
- Why do we need this command line:
grep -v '^__'
?
...
...
...
ATMG01370 0
ATMG01380 0
ATMG01390 272
ATMG01400 0
ATMG01410 0
__no_feature 8798
__ambiguous 4637
__too_low_aQual 1423
__not_aligned 2794
__alignment_not_unique 52464
Change the file mode to make it executable.
chmod u+x count_reads.sh
Run the command line below if you want to get count data from the STAR mapping results.
./count_reads.sh STAR ~/RNASeq_lab_2_DESeq/STAR_alignment_output pos
## needs 9m56.775s
Run the command line below if you want to get count data from the hisat mapping results.
./count_reads.sh hisat2 ~/RNASeq_lab_2_DESeq/hisat2_sorted_bam name
## needs 10m4.509s
- Change directory to
counts_hisat2
ORcounts_STAR
.
echo gene_ID $(ls DRR* | grep -o "DRR0161[0-9]*" | tr "\n" ' ') | tr -s [:blank:] ',' > count_data.csv
paste $(ls DRR* | sort) | awk '{for(i=3;i<=NF;i+=2) $i=""}{print}' | tr -s [:blank:] ',' >> count_data.csv
-
Transfer the file
count_data.csv
to your local computer with filezilla or thescp
command.- run the
scp command
on your local computer:
scp [email protected]:/path/to/count_data.csv /path/to/directory/you/want/to/put/count_data.csv
- run the
Open Rstudio and change working directory to where you store the file count_data.csv
.
install.packages("DESeq2")
library("DESeq2")
countData = read.csv('count_data.csv', header = TRUE, row.names = 1)
colData = read.csv("https://raw.githubusercontent.com/mestato/epp622/master/RNA_labs_data/experimental_info.csv", header = TRUE, row.names = 2)[, c("phenotype", "stress")]
colnames(countData) = paste0(colData$phenotype, '_', colData$stress)
rownames(colData) = paste0(colData$phenotype, '_', colData$stress)
## construct the data that analyzing functions from DESeq2 can recognize.
dds = DESeqDataSetFromMatrix(countData = countData,
colData = colData,
design = ~ phenotype + stress)
dds
- Pre-filtering: discard rows that have 0 for all treatments
dim(dds) ## before filtering
dds = dds[rowSums(counts(dds))>1, ]
dim(dds)
dds = DESeqDataSetFromMatrix(countData = countData,
colData = colData,
design = ~ phenotype + stress)
dds = dds[rowSums(counts(dds))>1, ]
dds = DESeq(dds)
res = results(dds)
res
Results
log2 fold change (MAP): stress saline vs ABA
Wald test p-value: stress saline vs ABA
DataFrame with 19453 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
AT1G01010 0.9246572 0.04106919 1.1664635 0.03520829 0.97191365 NA
AT1G01020 0.7345512 -0.50228525 1.1997038 -0.41867437 0.67545413 NA
AT1G01030 0.4918562 0.53399011 1.2159565 0.43915231 0.66055118 NA
AT1G01040 4.1881473 -0.88294298 0.7686287 -1.14872497 0.25066941 0.5074783
AT1G01050 12.9203449 0.95553494 0.4842900 1.97306342 0.04848834 0.1769671
... ... ... ... ... ... ...
ATMG01350 0.1237857 -0.6264714 0.9607553 -0.6520613 0.5143615962 NA
ATMG01360 0.4759696 1.1442281 1.2182994 0.9392011 0.3476275109 NA
ATMG01370 0.1996883 -0.4199930 1.0363781 -0.4052508 0.6852931724 NA
ATMG01380 0.1976165 0.3782464 1.0249796 0.3690282 0.7121066974 NA
ATMG01390 235.3293034 0.8171322 0.2399232 3.4058071 0.0006596878 0.00617547
Results explanation:
We choose gene ATMG01390 as an example
-
baseMean: the average of normalized counts across all samples. This represents the intercept of your GLM.
-
log2Foldchange: (stress saline vs ABA): log2(treated/untreated). Here it is log2(saline/ABA)
-
lfcSE: standard error of the log2FoldChange estimate
-
stat: statistic for the hypothesis test. For Wald test, it is log2FoldChange/lfcSE.
-
pvalue: the corresponding p-value from Wald test (or likelihood ratio test)
-
padj: adjusted p value due to multiple comparisons.
-
reasons for NA values:
- This gene has 0 count for all samples
- This row has an extreme count outlier
- Filtered by automatic independent filtering for having low mean count.
Get a summary of the results:
summary(res)
By default, the results()
function display comparison of the last level of the last variable over the first level of this variable.
We can extract results of comparisons between other levels:
results(dds, contrast=c("stress", "mock", "saline"))
We can also extract results of comparisons from the phenotype factor
results(dds, contrast=c("phenotype", "ros1-3", "wildtype"))
MA-plot of the results
plotMA(res)
Plot counts
We select the gene which has the smallest padj value to plot its count at different levels
mostSigGene = rownames(dds)[which.min(res$padj)]
mostSigGene
par(mfcol=c(1,2))
plotCounts(dds, gene=mostSigGene, intgroup="stress")
plotCounts(dds, gene=mostSigGene, intgroup="phenotype")
par(mfcol=c(1,1))
orderedRes = res[order(res$padj), ]
sigOrderedRes = subset(orderedRes, padj < 0.05)
write.csv(as.data.frame(sigOrderedRes), "STRESS_saline_vs_ABA.csv")
The LRT method compares the full model with the reduced model to see if the removed variable (factor) has significant effect on the fitted model.
dds = DESeqDataSetFromMatrix(countData = countData,
colData = colData,
design = ~ phenotype + stress)
dds = DESeq(dds, test="LRT", reduced = ~phenotype)
resLRT = results(dds)
resLRT
How many genes are significant in LRT test?
orderedResLRT = resLRT[order(resLRT$padj), ]
sigOrderedResLRT = subset(orderedResLRT, padj<0.05)
dim(sigOrderedResLRT)
How many genes are significant in Wald test?
orderedRes = res[order(res$padj), ]
sigOrderedRes = subset(orderedRes, padj < 0.05)
dim(sigOrderedRes)
Why is the number from LRT much larger than the number from Wald test?
Let's a gene that are significant in LRT but not in Wald test.
LRTsigGenes = rownames(sigOrderedResLRT)
Res_genesSigInLRT = res[LRTsigGenes, ]
notSigWaldGene = rownames(Res_genesSigInLRT)[which.max(Res_genesSigInLRT$padj)]
notSigWaldGene
padj comparison for the selected gene
resLRT[notSigWaldGene, ]
res[notSigWaldGene, ]
Let's plot the counts for this gene
plotCounts(dds, gene=notSigWaldGene, intgroup="stress")
What do you see from the plot?
Let's extract the results of comparisons between dehydration and "ABA".
res2 = results(dds, contrast=c("stress", "dehydration", "ABA"))
res2[notSigWaldGene, ]