-
Notifications
You must be signed in to change notification settings - Fork 28
Full pipeline for analysing massive datasets
Some datasets are too large to fit entirely into memory.
meffil
can handle these datasets by saving the methylation matrix
in a GDS file
that allows smaller parts of the matrix to be later loaded into memory as needed
(e.g. one row/CpG site at a time).
The first few QC and normalization steps are exactly the same as for smaller datasets.
# Load meffil and set how many cores to use for parallelization
library(meffil)
options(mc.cores=6)
# Read samplesheet
samplesheet <- meffil.create.samplesheet(path_to_idat_files)
# Background and dye bias correction, sexprediction, cell counts estimates
qc.objects <- meffil.qc(samplesheet, cell.type.reference="blood gse35069", verbose=TRUE)
# Load genotypes for comparison with those measured on the microarray
genotypes <- meffil.extract.genotypes(plink.files)
# Generate QC report
qc.summary <- meffil.qc.summary(qc.objects, genotypes=genotypes)
meffil.qc.report(qc.summary, output.file="qc/report.html")
# Remove outlier samples if necessary
qc.objects <- meffil.remove.samples(qc.objects, qc.summary$bad.samples$sample.name)
# Determine number of control probe principal components to use for normalization
print(meffil.plot.pc.fit(qc.objects)$plot)
# Perform quantile normalization
norm.objects <- meffil.normalize.quantiles(qc.objects, number.pcs=10)
There is a slight difference when it comes time to produce the normalized
methylation matrix. Here we tell meffil.normalize.samples
to save the normalized
methylation levels to a GDS file (gds.filename
) rather than retaining them in memory.
gds.filename <- "beta.gds"
meffil.normalize.samples(
norm.objects,
just.beta=T,
remove.poor.signal=T,
cpglist.remove=qc.summary$bad.cpgs$name,
gds.filename=gds.filename,
verbose=T)
Notice also that remove.poor.signal
is set to TRUE
.
This means that we are asking for methylation levels to be set to missing
if the underlying probe signal had a detection p-value that was too high
or the number of beads detected was too low.
Thresholds for these probe signal parameters was set during QC above (see meffil.qc()
).
To generate a normalization report, we will need principal components for the normalized methylation matrix. These can be calculated directly from the normalized methylation matrix in the GDS file.
autosomal.sites <- meffil.get.autosomal.sites("450k")
pcs <- meffil.methylation.pcs(gds.filename, sites=autosomal.sites)
The normalization report is then generated as usual.
norm.summary <- meffil.normalization.summary(norm.objects, pcs=pcs)
meffil.normalization.report(norm.summary, output.file="normalization/report.html")
A subset or even the entire methylation matrix can be loaded into memory
using the meffil.gds.methylation
function.
sites <- c("cg20076442", "cg25953130", "cg04521626", "cg14097568")
samples <- samplesheet$Sample_Name[1:5]
beta <- meffil.gds.methylation(gds.filename, sites=sites, samples=samples)
EWAS can also be performed while loading data for one CpG site at a time.
ewas.ret <- meffil.ewas(
gds.filename,
variable=variable,
covariates=covariates,
sites=autosomal.sites,
isva=F, sva=F, smartsva=T,
n.sv=10,
winsorize.pct=NA, outlier.iqr.factor=3,
verbose=T)
Notice that the command is identical as for a smaller dataset except that, where the methylation matrix would be passed to the function, we instead supply the GDS filename of the normalized methylation matrix.
Generating an EWAS report is similarly identical except that the GDS filename is supplied instead of the methylation matrix.
ewas.summary <- meffil.ewas.summary(
ewas.ret,
gds.filename,
verbose=T)
meffil.ewas.report(
ewas.summary,
output.file="ewas.html",
author="Me",
study="EWAS of a huge dataset")
It is possible to apply an arbitrary function to rows (CpG sites) or columns (samples) of the methylation matrix in the GDS file. In the following example, we calculate the methylation variance of each CpG site.
cpg.var <- meffil.gds.apply(gds.filename, bysite=T, type="double", FUN=var, na.rm=T)
- Installation
- Sample QC
- Functional normalization
- Functional normalizing separate datasets
- Extracting structural variants
- Estimating cellular composition
- Removing chrX and chrY probes
- Running EWAS
- Extracting CpG annotations
- Extracting SNP annotations
- Extracting detection p-values
- Extracting methylated and unmethylated intensities
- Generate normalization report from normalised betas
- Full pipeline for analysing massive datasets
- Common problems
- Citation