Skip to content
yangyifan622 edited this page Oct 1, 2020 · 4 revisions

Table of content:

  1. RNA-Seq

  2. eQTL

Data preprocessing

Step 1 of the MAGNet project is to git clone the MAGNet directory to your local computer. MAGNet is a program written in R. So we have to install the newest version of R and Rstudio according to their website first.

In bash:

git clone https://github.com/mpmorley/iPSC-CM.git    
cd ~/iPSC-CM

In R, set up the working directory and install the latest version of ExpressExtras package by entering the following code.

setwd('~/MAGNet')

install.packages("remotes")
remotes::install_github("Morriseylab/ExpressExtras")

Also, load all needed packages into the working directory.

library(dplyr)
library(readr)
library(ExpressExtras)    
library(sva)
library(limma)
library(edgeR)
library(DESeq2)

Read in the count file and the phenotype file using the main script DiffExpressionAnalysis.R.

exprs = read.csv('https://www.dropbox.com/s/i5dthgl5c5ij5gd/Counts.csv?dl=1', row.names = 1)
pData <- read_csv('phenoData.csv') %>% arrange(sample_name)

Step 2 is to annotate the genes by ENSEMBL ID.

genenames <- GeneAnnotate(as.character(rownames(exprs)),organism = "human")
rownames(genenames)=genenames$ENSEMBL

Step 3 is to create a DGEList object from the table of read counts by incorporating the gene length information for FPKM normalization, compute counts per million (CPM), and filter out the genes with a low number of counts, only keeping the genes that have expressions in >80% of samples.

len= read_table2("GeneLengths.txt")
genes= left_join(genenames, len, by="ENSEMBL")    
dge <- DGEList(counts=exprs[rownames(exprs) %in% genenames$ENSEMBL,], genes=genes)
cpms = cpm(dge)
keep = rowSums(cpms>0) >=.8*dim(dge)[2]
cpms = cpms[keep,]

Step 4 is to normalize and calculate the Fragments Per Kilobase of transcript per Million mapped reads (FPKM) by edgeR.

FPKM = FPKM[keep,]
FPKM <- calcNormFactors(FPKM) 
FPKM <- rpkm(FPKM) 

Batch effect correction

MAGNet was sequenced in 12 batches. PCA analysis shows that there is variation due to batch and other sources in variation within the first PCA component. Batch correction was not sufficient to correct for this unwanted sources of variations.

PCA

We therefore applied surrogate variant analysis using the svaseq function from the SVA package. mod is the model matrix for fitting the data while mod0 is the null model being compared to mod. 24 sva principal components have been used for correcting the batch effect. The final corrected result based on CPMs is stored in CPMS_SVA_corrected.RDS.

  mod <- model.matrix(~age+etiology+race+gender, pData)
  mod0 <- model.matrix(~1,data=pData) 
  cpms=cpms+1 
  unsup_sva = svaseq(as.matrix(cpms),mod,mod0)
  cpms.corr <- svaBatchCor(cpms,mod,mod0,n.sv=24)$corrected
  saveRDS(cpms.corr,file="CPMS_SVA_corrected.RDS")

Similarly, the correction based on FPKM can be obtained using the following code.

  unsup_sva.fpkm = svaseq(as.matrix(FPKM),mod,mod0)
  fpkm.corr <- svaBatchCor(FPKM,mod,mod0,n.sv=24)$corrected
  saveRDS(fpkm.corr,file="FPKM_SVA_corrected.RDS")

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

LIMMA Voom

Voom is used for transforming RNAseq data ready for linear modeling to estimate the mean-variance relationship in the dataset. LIMMA is used for identifying differentially expressed genes between two conditions.

Step 1 is to make a new conjugate variable called disease_race in pData to capture the interaction between the two terms in the linear model.

  pData <- pData %>% mutate(disease_race = paste0(race,'_',etiology))

Step 2 is to define the linear regression model and build the design matrix

  design = model.matrix(~0+pData$age+pData$disease_race+pData$gender + unsup_sva$sv)

Step 3 is to use Voom to estimate the mean-variance relationship in the dataset.

  v <- voom(dge,design,plot=TRUE) 

Step 4 is to build the contrast matrix for each comparison.

  contr.matrix <- makeContrasts(
                  AADCMvsNF = disease_raceAA_DCM-disease_raceAA_NF,
                  CauDCMvsNF = disease_raceCaucasian_DCM-disease_raceCaucasian_NF,
                  CauHCMvsNF = disease_raceCaucasian_HCM-disease_raceCaucasian_NF,
                  CauNFvsAANF=disease_raceCaucasian_NF-disease_raceAA_NF,
                  CauDCMvsAADCM=disease_raceCaucasian_DCM-disease_raceAA_DCM,
                  levels = colnames(design))

Step 5 is to fit the model in LIMMA and smooth out the standard errors by empirical Bayes.

  vfit <- lmFit(v, design)
  vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
  efit <- eBayes(vfit) 

  #Return results 
  topTable(efit,coef=1,n=25,p.value=0.05) 

DESeq2

We also used DESeq2 to estimate the differentially expressed genes in MAGNet. Apart from LIMMA, which used a gene-wise linear model linked by a global empirical bayes to capture the shared information between genes, DESeq2 used a negative binomial model to model the gene counts followed by a dispersion shrinkage.

Step 1 is to load the count data and specify the variables that the counts depend on in design. In our case, the count data depends on the race, etiology, gender, age and the interaction between race and etiology (similar to the race_disease variable in LIMMA).

dds <- DESeqDataSetFromMatrix(countData = dge[["counts"]], colData = pData, design= ~ race + etiology + race:etiology + gender + age )

Step 2 is to perform the differential expression analysis by DESeq2.

 dds <- DESeq(dds)

Step 3 is to exam the results by specifying the comparison groups. For instance, the DE genes between Caucasian and African American in NF (Non-Failure), HCM and PPCM, or the DE genes between DCM, HCM, PPCM and NF in the population of Caucasian and African American.

 results(dds, contrast=list(c("race_Caucasian_vs_AA","etiologyNF.raceCaucasian")))
 results(dds, contrast=list(c("race_Caucasian_vs_AA","etiologyHCM.raceCaucasian")))
 results(dds, contrast=list(c("race_Caucasian_vs_AA","etiologyPPCM.raceCaucasian")))

 results(dds, contrast=list(c("etiology_NF_vs_DCM","etiologyNF.raceCaucasian")))
 results(dds, contrast=list(c("etiology_HCM_vs_DCM","etiologyHCM.raceCaucasian")))
 results(dds, contrast=list(c("etiology_PP_vs_DCM","etiologyNF.raceCaucasian")))

Visualization

MAGNet Browser is an interactive web application built using the R-Shiny framework that makes this data easily accessible to the users while also allowing them to explore the results of the differential expression and eQTL analysis.This interface can be easily run through R Studio using the Shiny package.

Clone this wiki locally