Skip to content

Updated version of PAQR, which was previously available in the PAQR_KAPAC joint repository.

License

Notifications You must be signed in to change notification settings

zavolanlab/PAQR2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ci GitHub issues GitHub license

PAQR

PAQR is a tool (implemented as snakemake workflow) that allows the quantification of transcript 3' ends (or poly(A) sites) based on standard RNA-seq data. As input it requires alignment files in BAM format (along with their corresponding ".bai" indices) and a bed file with coordinates of known "tandem" poly(A) sites (i.e. poly(A) sites that belong to the same gene). It returns a table of quantified tandem poly(A) sites.

For more information, please refer to the original PAQR publication.
The repository mentioned in the publication is accessible here. Please be aware that that repository is no longer maintained, and the repository you're currently looking at contains the most up-to-date version of PAQR.

Compatible input data:
By default paired-end sequencing with read1 - reverse orientation, read2 - forward orientation is assumed. If your data is unstranded, you'll have to specify this in the config.yaml.
Single-stranded data with the reads in sense direction are processed properly too, but PAQR does not support single-end data in reverse orientation.

Installation

1. Clone the repository

Go to the desired directory/folder on your file system, then clone/get the repository and move into the respective directory with:

git clone [email protected]:zavolanlab/PAQR2.git
cd PAQR2

2. Conda installation

Workflow dependencies can be conveniently installed with the Conda package manager. We recommend that you install Miniconda for your system (Linux). Be sure to select Python 3 option.

3. Dependencies installation

For improved reproducibility and reusability of the workflow, each individual step of the workflow runs either in its own Singularity container OR in its own Conda virtual environemnt. As a consequence, running this workflow has very few individual dependencies. If you want to make use of container execution, please install Singularity in privileged mode, depending on your system*. You may have to ask an authorized person (e.g., a systems administrator) to do that. This will almost certainly be required if you want to run the workflow on a high-performance computing (HPC) cluster.

After installing Singularity, or should you choose not to use containerization but only conda environments, install the remaining dependencies with:

conda env create -f install/environment.yml

*If you have a Linux machine, as well as root privileges, (e.g., if you plan to run the workflow on your own computer), you can execute the following command to include Singularity in the Conda environment instead:

conda env create -f install/environment.root.yml

4. Activate environment

Activate the Conda environment with:

conda activate paqr2

5. Testing the execution

This repository contains a small test dataset included for the users to test their installation of PAQR. In order to initiate the test run (with conda environments technology) please navigate to the root of the cloned repository (make sure you have the conda environment paqr2 activated) and execute the following command:

bash execute/run_local_conda_test.sh

Preparations

1. Create tandem poly(A) sites file

For poly(A) site quantification and calculation of UTR length changes, PAQR requires a reference of known "tandem" poly(A) sites in bed format with additional columns. This file can be conveniently created with the tandem PAS pipeline, which uses the PolyASite atlas as a global reference of poly(A) sites. Only poly(A) sites on terminal exons, not overlapping with exons of other transcripts are selected. Different files for stranded and unstranded RNA-seq data analysis can be created. The columns of the tandem PAS file are as follows:

Column Value Comments
1 chromosome Ensembl naming scheme (no leading "chr")
2 start start of poly(A) site cluster (or single site)
3 end end of poly(A) site cluster (or single site)
4 ID identifier in the format "chr:site:strand", where site is the representative site of the cluster (or the single nucleotide position of the single site)
5 score e.g. tpm
6 strand + or -
7 PAS rank rank of the poly(A) site among its siblings in current transcript relative to 5' end
8 number of tandem PAS total number of poly(A) sites present in transcript
9 exon info in the format "transcript_ID:total_exons:current_exon:start:stop". Ensembl transcript ID, the number of exons belonging to that transcript, the rank of the considered exon (equals the number of exons if only terminal exons are considered), start and stop coordinates of the exon
10 gene ID Ensembl gene ID

2. Ensure sufficient quality of your input samples

For PAQR to work correctly, it is crucial that the input RNA-seq samples are of good quality. We therefore strongly advise you to run a TIN-score calculation on your samples before using them in PAQR. As a rule of thumb, the Median TIN score across all transcripts in a sample should be at least 70 in order for PAQR to give reliable results.

3. Configure the input parameters

The file configs/config.yaml contains all information about used parameter values, data locations, file names and so on. During a run, all steps of the PAQR pipeline will retrieve their paramter values from this file. It follows the yaml syntax (find more information about yaml and it's syntax here) making it easy to read and edit. The main principles are:

  • everything that comes after a # symbol is considered as comment and will not be interpreted
  • paramters are given as key-value pair, with key being the name and value the value of any paramter

Some entries require your editing while most of them you can leave unchanged. The comments should give you the information about the meaning of the individual parameters. If you need to change path names please ensure to use relative instead of absolute path names.

4. Prepare a "samples.tsv"

This file will contain the names (column header "ID"), conditions (header "condition") and paths (relative to the execution directory)(header "bam") to all your input bam files. For an example see configs/samples.tsv

NOTE: PAQR requires .bam AND corresponding .bam.bai files to be placed alongside in the same directory. It also expects the basenames of the two files to be the same. Thus, only .bam filepaths have to be given in the samples table, and the corresponding .bai filepath is inferred from there.

Execution

Create a new directory for your analysis within this directory and cd into it. Make sure you have the conda environment paqr2 activated. For your convenience, the directory execute contains bash scripts that can be used to start local and slurm runs, using either singularity or conda.

For example, you could run the example config configs/config.yml locally with singularity with:

bash snakemake_local_run_singularity_containers.sh configs/config.yml

Pipeline steps

rule_graph

Outputs

For each sample separately:

  • wiggle files of read coverages
  • UTR extensions made if known PAS downstream of annotated exon All samples represented in one table:
  • tables of tandem PAS positions (tsv; columns: coordinate, relative position within terminal exon)
  • table of PAS relative usage (tsv; columns: chromosome, start, end, PAS ID, score, strand, PAS index on current exon, number of PAS on current exon, exon, gene, relative usage for each sample)
  • table of tandem PAS expression (tsv; columns same as above, tpm instead of relative usage)
  • table of "singular" PAS, where PAQR could not detect any usage of the PAS's tandem "siblings" (tsv; columns same as above)
  • table of weithed average exon lengths (tsv; columns: exon, relative exon length for each sample)
  • CDF plot of weighted average exon lengths (pdf)

About

If you're using PAQR in your research, please cite
Gruber, A.J., Schmidt, R., Ghosh, S. et al. Discovery of physiological and cancer-related regulators of 3′ UTR processing with KAPAC. Genome Biol 19, 44 (2018). https://doi.org/10.1186/s13059-018-1415-3

About

Updated version of PAQR, which was previously available in the PAQR_KAPAC joint repository.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •