GitHub - emdann/momicsTools: A collection of resources and knowledge for analysis of scATAC+scRNA multi omics data

A collection of resources on analysis of scATAC+scRNA multi-omic data

scATAC analysis

Picking features

Because Tn5 insertions can take place anywhere in the genome, after alignment for each cell barcode we will have a set of fragments covering different genomic positions. To allow comparison between cells we need to define a common set of genomic features. Once a feature set is defined, we count the number of fragments overlapping each feature for each single-cell, obtaining a (quasi-)binary feature x cell count matrix.

Different pipelines/papers use different strategies to define a feature set.

Binning the genome into equally sized windows (5-10kb) (e.g. SnapATAC). Using a set of genomic bins is a good way to start exploring your data, however when working with a large genome (e.g. human and mouse) this generates a huge matrix with over 200k features, making many analysis steps much slower. In addition, epigenomics studies highlight that the tipical size of regulatory regions is closer to hundreds rather than thousands of bps.
Peak calling: take a pseudo-bulk coverage track of all the cells or a group of cells and identify peaks in this signal. The most used algorithm for peak calling is MACS.
- Peak calling from CellRanger (big caveat: does peak calling for each sample independently)
- Cusanovich2018 approach: starts by binning the genome into fixed-size windows and building a bin x cell binary matrix. Bins that overlap ENCODE-defined blacklist regions are filtered out, and the top 20,000 most commonly used bins are retained. Then, the bins-by-cells binary matrix is normalized and rescaled using the term frequency-inverse document frequency (TF-IDF) transformation. Next, singular value decomposition (SVD) is performed to generate a PCs-by-cells LSI score matrix, which is used to cluster cells. Within each cluster, peak calling is performed on the aggregated scATAC-seq profiles
Using a annotated set of enhancers/regulatory regions in your genome of interest (frequently seen for studies in Drosophila)
Other scATAC-specific methods for feature extraction:
- Scregseg: uses an HMM with Dirichlet-Multinomial emission probabilities to segment the genome according to distinct relative cross-cell accessibility profiles (paper)
- [BROCKMAN]: k-mer based accessibility (paper)

Quality control

Calculating QC metrics (Signac)(ArchR)
- TSS enrichment: we expect an enrichment of fragments around transcription start sites
- Fragment size distribution: we expect to see regular peaks, these correspond to fragments that span a different number of nucleosomes
Filtering/refining peaks: this is dependent on the peak calling algorithms, but you might want to filter some peaks out (also to make the size of the dataset more manageable):
- Remove peaks overlapping with ENCODE blacklist (an R-ready annotation for model organism genomes can be found in Signac)
- Remove peaks that are accessible in less than n% cells in any given cluster (strategy used in Cusanovish2018)
- Filter by peak width: peaks that are exactly as wide as the peak-calling width threshold are usually just a few reads from a few cells. On the other side, keeping very wide peaks might introduce artifacts (bigger peak, more reads)
Filtering cells:
- Basic filtering of cells with very low coverage (these often mess up dimensionality reduction if unfiltered)
- ArchR doublet detection

Dimensionality reduction

TF-IDF + SVD (Latent Semantic Indexing) (Signac) (ArchR)(muon)
Latent Dirichlet Allocation (cisTopic)
PeakVI (paper)(code)

Additional resources:

Blogpost on dimensionality reduction in scATAC-seq

Defining gene-level features

Summarising accessibility over gene promoters/bodies is often useful for data exploration and for qualitative or quantitative comparison with RNA. Different pipelines/papers use different strategies to reduce accessibility signal to a gene x cell matrix.

Counting fragments over gene bodies and promoters (implemented in Signac)
ArchR gene scoring: uses counts over gene bodies and promoter + weighting for distance of regulatory elements around the gene
Cicero gene activity score: based on measuring co-accessibility between peaks
Gene scores from cisTopic: taking the average de-noised accessibility signal from peaks around each gene (example implementation)

Motif analysis

chromVAR: determine variations in chromatin accessibility across peaks containing a set of TF binding motifs

Data structures for multi-omics

MUON data (code)(preprint) - python - extension of AnnData
MultiAssayExperiment object (code) - R/Bioconductor - extension of SummarizedExperiment
Alternative experiment slot in SingleCellExperiment object (see OSCA])

Joint dimensionality reduction

j-SNE and j-UMAP (paper) (code)
Multi-omics factors analysis: (paper) (code) (vignette)
Seurat V4 Weighted Nearest Neighbor analysis (paper)(preprint)(code)(vignette) - extends to > 2 modalities
Schema (paper) (code)
Multigrate
MultiVI (paper)(vignette)
scMVP (paper)(code)
NeurIPS Open Problems challenge results: Multimodal single cell data integration challenge: results and lessons learned (paper)
Cobolt (paper) (code)

Peak-gene matching

DORC analysis (example code)
SEAcells - aggregating at metacell level for regulatory interation analysis (code) (paper)

Publicly available multi-omic datasets

(add newest on top)

Arguelaguet et al. 2022 - 10X multiome data of early mouse organogenesis
Fleck et al. 2021 - Multiome data of human cerebral organoids
Mimitou et al. 2021 - 10X genonics multiome + surface proteins, demonstrated on PBMCs
Dou et al. 2020 9383 mouse retina cells, used as gold standard for diagonal integration algorithm
Trevino et al. 2020 Chromatin and gene-regulatory dynamics of the developing human cerebral cortex at single-cell resolution (8981 cells)
Example data from 10X
SHARE-seq data
SNARE-seq data
sciCAR-data

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of contents

scATAC analysis

Picking features

Quality control

Dimensionality reduction

Defining gene-level features

Motif analysis

Data structures for multi-omics

Joint dimensionality reduction

Peak-gene matching

Publicly available multi-omic datasets

About

Releases

Packages

emdann/momicsTools

Folders and files

Latest commit

History

Repository files navigation

Table of contents

scATAC analysis

Picking features

Quality control

Dimensionality reduction

Defining gene-level features

Motif analysis

Data structures for multi-omics

Joint dimensionality reduction

Peak-gene matching

Publicly available multi-omic datasets

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages