TE quantification snakemake workflow

Overview

This is a pipeline for the quantification of Transposable Elements in the single-cell STORM-seq samples. It can also be used with other technologies like SMART-seq, SMART-seq2 etc.

Dependencies

Command Line tools

featureCounts

R libraries

data.table
dplyr
stringr
optparse
ggplot2
scales
ggh4x
Scuttle

Steps in the pipeline

Aligned bam files are used as inputs to featureCounts with parameters -F SAF -O -B -p --fracOverlap 0.1 -M -s 0 --fraction specified in the config.yaml file. These can also be altered as per the requirement.
feature counts files for all the cells are then used to generate a combined raw count, counts per million, raw count for only intergenic and intronic TEs and counts per millions for only intergenic and intronic TEs matrices.
If filtering requirement is set to be True in the config.yaml file then a quick Scuttle filtering is performed to filter out low quality cells and generate filtered count matrices.
Counts per millions for only intergenic and intronic TEs matrix is used for the log enrichment calculation of TE's
Enrichment score is calculated as per the folloing formula.

$$ enrichment\ score = {{{Number\ of\ TE\ subfamilies\ >\ 1cpm} \over {Number\ of\ TEs\ >\ 1cpm}} \over {{Number\ of\ TE\ subfamilies} \over {Number\ of\ TEs}}} $$

If the log enrichment heatmap plot requirement is set to be True in the config.yaml file then a heatmap plot, generated using ggplot2, is also saved as a pdf file.

Using the workflow

Clone the repo (https://github.com/AyushSemwal/TE_quantification_snakemake) using:
- HTTPS: https://github.com/AyushSemwal/TE_quantification_snakemake.git
- SSH: [email protected]:AyushSemwal/TE_quantification_snakemake.git
Unzip hg38_pc_te_chrM.saf.tar.gz and intergenic_intronic_tes.txt.tar.gz in the config folder.
Modify the config.yaml file in the config folder to specify aligned_bam_dir (aligned bam files directory), output_dir (directory where all output files and sub directories will be stored) and other parameters as per the requirements. Even though I have assigned intuitive names to the parameters, I have also added comments in front of them.
Populate the samples.tsv file such that first column is contains cell names (or sample names in case of bulk samples) and second columns contains the bam file names. Do not add a header to this file.
If slurm is available then submit the job by running sbatch bin/workflow_sbatch.sh from the parent directory else you can run snakemake --use-conda --cores {num_cores}.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
bin		bin
config		config
envs		envs
rules		rules
scripts		scripts
.DS_Store		.DS_Store
README.md		README.md
Snakefile		Snakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TE quantification snakemake workflow

Overview

Dependencies

Command Line tools

R libraries

Steps in the pipeline

Using the workflow

About

Releases

Packages

Languages

huishenlab/TE_quantification_snakemake

Folders and files

Latest commit

History

Repository files navigation

TE quantification snakemake workflow

Overview

Dependencies

Command Line tools

R libraries

Steps in the pipeline

Using the workflow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages