SCAPE is a tool for estimating alternative polyadenylation (APA) events from single-end or pair-end scRNA-seq data (10x, Microwell-seq). The scRNA-seq must be generated by protocols based on polyT enrichment. SCAPE utilizes the insert size information of cDNA fragments to estimate polyadenylation sites (pA sites).
SCAPE models each pA sites as a Gaussian distribution, the corresponding isoform of which also has a weight (proportion). The final estimations are the mean and standard deviation of all pA sites, as well as their weights. The following figure shows the basic idea of SCAPE.
SCAPE-APA (A update version, 2024.3.15)
To accommodate the analysis of extensive scRNA-seq datasets, SCAPE-APA has been introduced as a revamped version of SCAPE, incorporating significant modifications. The updates to the package include:
- Grouping similar reads to expedite the estimation process.
- A revised mixture model specifically adapted for binned reads.
- Implementation of a Taichi framework for enhanced computational speed.
- Detection of erroneous alternative polyadenylation sites arising from junction reads.
- Support for installation via PyPI.
ref: SCAPE-APA: a package for estimating alternative polyadenylation events from scRNA-seq data
This can be single-end or pair-end sequencing scRNA-seq data. We only tested on data sequenced on Illumina platform. If special treatments have been taken regarding insert size in the library preparation, it is recommended to adjust the insert size distribution in the source code.
The Gene transfer format (GTF) is a file format used to hold information about gene structure. Since APA is an splicing event in the 3'-UTR, only reads falls in the 3'-UTR is relevant. GTF file is needed to pull out reads that fall into 3'-UTR, which is then used for the analysis.
GTF files for different species can be downloaded from Ensembl database.
The output will be a csv file correpsonding to a pA count matrix. Each row is a pA site and each column is a cell.
There is no need to install anything other than python3. Just download this repository and run the python scripts.
SCAPE is written in Python. It is a pre-requisite to have python3 installed before running SCAPE. SCAPE has two general functions (1) data preprocessing and (2) APA event inference.
$ git clone https://github.com/LuChenLab/SCAPE
$ conda create -n scape python=3.7.3
$ conda source activate scape
$ pip install -r SCAPE/requirements.txt
$ python SCAPE/main.py --help
Usage: python main.py [OPTIONS] COMMAND [ARGS]...
An analysis framework for analysing alternative polyadenylation at single
cell levels. Current version: 1.0.0
Options:
--help Show this message and exit.
Commands:
apamix
prepare
Preprocessing utr and intron region for inferring alternative polyadenylation events
Usage: python main.py prepare [OPTIONS]
Options:
--gtf TEXT The gtf (gz or not, but must be sorted) file for preparing utr and intron region.
[required]
--prefix TEXT The prefix of output bed file.
--help Show this message and exit.
run with bedtools version 2.26.0
# bedtools v2.26.0
bedtools sort -i GRCh38.p12.cr.gtf | bgzip > GRCh38.p12.cr.gtf.gz
python main.py prepare --gtf GRCh38.p12.cr.gtf.gz --prefix GRCh38
Perform the actual alternative polyadenylation events inference. If one wants to change the default paramters such as insert size distribution, maximum number of pA sites etc, it can be done by modifying the file "apamix.py" under folder "apamix".
Usage: python main.py apamix [OPTIONS]
Options:
--bed TEXT The target regions (bed format) used for mix
inference [required]
--bam TEXT The bam file (sorted and indexed) [required]
-o, --out TEXT The output path [required]
--cores INTEGER Num (cores) of region are infering at once
--cb TEXT The cell barcode file, one cb for one line.
[required]
--tag TEXT The cell barcode and UMI tag, for 10X: CB,UB.
[required]
--n_max_apa INTEGER The maximum number of pA sites. Default value is 5.
[required]
--n_min_apa INTEGER The minimum number of pA sites. Default value is 1.
[required]
--max_utr_len INTEGER The maximum length of UTR. Default value is 6000.
--la_dis_arr TEXT The distinct lengths of polyA lengths in the dataset.
Default: np.arange(self.min_LA, self.max_LA, 10).
User counld pass '[10, 30, 50, 70, 90, 110, 130]'.
--pmf_la_dis_arr TEXT The the number of reads for each distinct polyA
length. Default: Unif(min_LA, max_LA). '[309912,
4107929, 802856, 518229, 188316, 263208, 101]'.
--mu_f INTEGER The mean of insert size for illumina library. Default: 300
--sigma_f INTEGER The std of insert size for illumina library. Default: 50
-v, --verbose Verbose mode
--help Show this message and exit.
run
python main.py apamix \
--bed GRCh38_utr.bed \
--bam input.bam \
--out data/ \
--cores 12 \
--cb cellbarcode.tsv
The output of SCAPE can be loaded into a Seurat object in R, which could be easily use for downstream single cell analyses offered by Seurat.
Visit wiki for examples of running SCAPE.
The code for method comparison in our paper can be found here.
If you use SCAPE in your publication, please cite SCAPE by
Zhou et al. SCAPE: A mixture model revealing single-cell polyadenylation diversity and cellular dynamic during cell differentiation and reprogramming. Nucleic Acids Research. 2022, gkac167, https://doi.org/10.1093/nar/gkac167