Skip to content

Full pipeline for analysing massive datasets

Matthew Suderman edited this page Oct 2, 2021 · 3 revisions

Some datasets are too large to fit entirely into memory. meffil can handle these datasets by saving the methylation matrix in a GDS file that allows smaller parts of the matrix to be later loaded into memory as needed (e.g. one row/CpG site at a time).

The first few QC and normalization steps are exactly the same as for smaller datasets.

# Load meffil and set how many cores to use for parallelization
library(meffil)
options(mc.cores=6)

# Read samplesheet
samplesheet <- meffil.create.samplesheet(path_to_idat_files)

# Background and dye bias correction, sexprediction, cell counts estimates
qc.objects <- meffil.qc(samplesheet, cell.type.reference="blood gse35069", verbose=TRUE)

# Load genotypes for comparison with those measured on the microarray
genotypes <- meffil.extract.genotypes(plink.files)

# Generate QC report
qc.summary <- meffil.qc.summary(qc.objects, genotypes=genotypes)
meffil.qc.report(qc.summary, output.file="qc/report.html")

# Remove outlier samples if necessary
qc.objects <- meffil.remove.samples(qc.objects, qc.summary$bad.samples$sample.name)

# Determine number of control probe principal components to use for normalization
print(meffil.plot.pc.fit(qc.objects)$plot)

# Perform quantile normalization
norm.objects <- meffil.normalize.quantiles(qc.objects, number.pcs=10)

There is a slight difference when it comes time to produce the normalized methylation matrix. Here we tell meffil.normalize.samples to save the normalized methylation levels to a GDS file (gds.filename) rather than retaining them in memory.

gds.filename <- "beta.gds"
meffil.normalize.samples(
    norm.objects,
    just.beta=T,
    remove.poor.signal=T,
    cpglist.remove=qc.summary$bad.cpgs$name,
    gds.filename=gds.filename,
    verbose=T)

Notice also that remove.poor.signal is set to TRUE. This means that we are asking for methylation levels to be set to missing if the underlying probe signal had a detection p-value that was too high or the number of beads detected was too low. Thresholds for these probe signal parameters was set during QC above (see meffil.qc()).

To generate a normalization report, we will need principal components for the normalized methylation matrix. These can be calculated directly from the normalized methylation matrix in the GDS file.

autosomal.sites <- meffil.get.autosomal.sites("450k")
pcs <- meffil.methylation.pcs(gds.filename, sites=autosomal.sites)

The normalization report is then generated as usual.

norm.summary <- meffil.normalization.summary(norm.objects, pcs=pcs)
meffil.normalization.report(norm.summary, output.file="normalization/report.html")

A subset or even the entire methylation matrix can be loaded into memory using the meffil.gds.methylation function.

sites <- c("cg20076442", "cg25953130", "cg04521626", "cg14097568")
samples <- samplesheet$Sample_Name[1:5]
beta <- meffil.gds.methylation(gds.filename, sites=sites, samples=samples)

EWAS can also be performed while loading data for one CpG site at a time.

ewas.ret <- meffil.ewas(
    gds.filename,
    variable=variable, 
    covariates=covariates,
    sites=autosomal.sites,
    isva=F, sva=F, smartsva=T,
    n.sv=10,
    winsorize.pct=NA, outlier.iqr.factor=3,
    verbose=T)

Notice that the command is identical as for a smaller dataset except that, where the methylation matrix would be passed to the function, we instead supply the GDS filename of the normalized methylation matrix.

Generating an EWAS report is similarly identical except that the GDS filename is supplied instead of the methylation matrix.

ewas.summary <- meffil.ewas.summary(
    ewas.ret,
    gds.filename,
    verbose=T)
meffil.ewas.report(
    ewas.summary,
    output.file="ewas.html",
    author="Me",
    study="EWAS of a huge dataset")

It is possible to apply an arbitrary function to rows (CpG sites) or columns (samples) of the methylation matrix in the GDS file. In the following example, we calculate the methylation variance of each CpG site.

cpg.var <- meffil.gds.apply(gds.filename, bysite=T, type="double", FUN=var, na.rm=T)