A collection of resources on analysis of scATAC+scRNA multi-omic data
- scATAC analysis
- Data structures for multi-omics
- Joint dimensionality reduction
- Peak-gene matching
- Publicly available multi-omic datasets
Because Tn5 insertions can take place anywhere in the genome, after alignment for each cell barcode we will have a set of fragments covering different genomic positions. To allow comparison between cells we need to define a common set of genomic features. Once a feature set is defined, we count the number of fragments overlapping each feature for each single-cell, obtaining a (quasi-)binary feature x cell count matrix.
Different pipelines/papers use different strategies to define a feature set.
-
Binning the genome into equally sized windows (5-10kb) (e.g. SnapATAC). Using a set of genomic bins is a good way to start exploring your data, however when working with a large genome (e.g. human and mouse) this generates a huge matrix with over 200k features, making many analysis steps much slower. In addition, epigenomics studies highlight that the tipical size of regulatory regions is closer to hundreds rather than thousands of bps.
-
Peak calling: take a pseudo-bulk coverage track of all the cells or a group of cells and identify peaks in this signal. The most used algorithm for peak calling is MACS.
- Peak calling from CellRanger (big caveat: does peak calling for each sample independently)
- Cusanovich2018 approach: starts by binning the genome into fixed-size windows and building a bin x cell binary matrix. Bins that overlap ENCODE-defined blacklist regions are filtered out, and the top 20,000 most commonly used bins are retained. Then, the bins-by-cells binary matrix is normalized and rescaled using the term frequency-inverse document frequency (TF-IDF) transformation. Next, singular value decomposition (SVD) is performed to generate a PCs-by-cells LSI score matrix, which is used to cluster cells. Within each cluster, peak calling is performed on the aggregated scATAC-seq profiles
-
Using a annotated set of enhancers/regulatory regions in your genome of interest (frequently seen for studies in Drosophila)
-
Other scATAC-specific methods for feature extraction:
-
Calculating QC metrics (Signac)(ArchR)
- TSS enrichment: we expect an enrichment of fragments around transcription start sites
- Fragment size distribution: we expect to see regular peaks, these correspond to fragments that span a different number of nucleosomes
-
Filtering/refining peaks: this is dependent on the peak calling algorithms, but you might want to filter some peaks out (also to make the size of the dataset more manageable):
- Remove peaks overlapping with ENCODE blacklist (an R-ready annotation for model organism genomes can be found in Signac)
- Remove peaks that are accessible in less than n% cells in any given cluster (strategy used in Cusanovish2018)
- Filter by peak width: peaks that are exactly as wide as the peak-calling width threshold are usually just a few reads from a few cells. On the other side, keeping very wide peaks might introduce artifacts (bigger peak, more reads)
-
Filtering cells:
- Basic filtering of cells with very low coverage (these often mess up dimensionality reduction if unfiltered)
- ArchR doublet detection
- TF-IDF + SVD (Latent Semantic Indexing) (Signac) (ArchR)(muon)
- Latent Dirichlet Allocation (cisTopic)
- PeakVI (paper)(code)
Additional resources:
Summarising accessibility over gene promoters/bodies is often useful for data exploration and for qualitative or quantitative comparison with RNA. Different pipelines/papers use different strategies to reduce accessibility signal to a gene x cell matrix.
- Counting fragments over gene bodies and promoters (implemented in Signac)
- ArchR gene scoring: uses counts over gene bodies and promoter + weighting for distance of regulatory elements around the gene
- Cicero gene activity score: based on measuring co-accessibility between peaks
- Gene scores from cisTopic: taking the average de-noised accessibility signal from peaks around each gene (example implementation)
- chromVAR: determine variations in chromatin accessibility across peaks containing a set of TF binding motifs
- MUON data (code)(preprint) - python - extension of AnnData
- MultiAssayExperiment object (code) - R/Bioconductor - extension of SummarizedExperiment
- Alternative experiment slot in SingleCellExperiment object (see OSCA])
- j-SNE and j-UMAP (paper) (code)
- Multi-omics factors analysis: (paper) (code) (vignette)
- Seurat V4 Weighted Nearest Neighbor analysis (paper)(preprint)(code)(vignette) - extends to > 2 modalities
- Schema (paper) (code)
- Multigrate
- MultiVI (paper)(vignette)
- scMVP (paper)(code)
- NeurIPS Open Problems challenge results: Multimodal single cell data integration challenge: results and lessons learned (paper)
- Cobolt (paper) (code)
- DORC analysis (example code)
- SEAcells - aggregating at metacell level for regulatory interation analysis (code) (paper)
(add newest on top)
- Arguelaguet et al. 2022 - 10X multiome data of early mouse organogenesis
- Fleck et al. 2021 - Multiome data of human cerebral organoids
- Mimitou et al. 2021 - 10X genonics multiome + surface proteins, demonstrated on PBMCs
- Dou et al. 2020 9383 mouse retina cells, used as gold standard for diagonal integration algorithm
- Trevino et al. 2020 Chromatin and gene-regulatory dynamics of the developing human cerebral cortex at single-cell resolution (8981 cells)
- Example data from 10X
- SHARE-seq data
- SNARE-seq data
- sciCAR-data