Skip to content

A quality control pipeline for PanTools input data

Notifications You must be signed in to change notification settings

bejo-dionnez/pantools-qc-pipeline

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pantools-qc-pipeline

Pipeline for quality control of PanTools input data.

This pipeline can be used to obtain genome, annotation and protein data statistics, filter genome and annotation data, extract protein sequences and create functional annotations.

Requirements: Snakemake, Mamba.

Cloning this git

For cloning this git, run:

git clone https://github.com/PanUtils/pantools-qc-pipeline
cd pantools-qc-pipeline

Install Snakemake and Mamba

If you don't have mamba, install it using

conda install -n base -c conda-forge mamba

Then, a Snakemake environment can be created using

conda activate base
mamba create -c conda-forge -c bioconda -n snakemake snakemake

Which can be activated and verified with

conda activate snakemake
snakemake --help

Specify config settings

By default, the pipeline uses the provided test data set as raw input data, this can be changed by updating the input paths in the provided config.yaml. Filtering parameters, output paths and scratch directory can also be altered.

Input data

Two input directories are required. One with genomic fasta files and one with matching annotations. All fasta files must end in .fna, all annotations files in .gff. If this is not the case, the genome and annotation file extensions can be altered using:

for file in <genomes>/*.fa*; do mv -- "$file" "${file%.fa*}.fna"; done
for file in <annotations>/*.gff3; do mv -- "$file" "${file%.gff3}.gff"; done

By default, the pipeline assumes the genome and annotation files match alphabetically. If this is not the case, a tsv file needs to be provided in the config with the file names or paths of the matching files. For example:

genome                  annotation
genome1.fna             annotation1.gff
/path/to/genome2.fna    /path/to/second_annotation.gff
...                     ...

Run the pipeline

The pipeline can be run with

snakemake [rule] --use-conda --cores <threads> [--configfile <config>]

Where is the number of threads to run on, and a custom config file. If no config is provided, the pipeline will run on a small yeast test dataset. The possible rules are discussed below. The pipeline will create everything except for the raw statistics if no rule is provided.

Rules

raw_statistics

Provide statistics of raw genome and annotation data, and extracted protein sequences of the raw data. These statistics can be used to set the filtering parameters for the other rules.

filter

Filter the genomic fasta based on sequence length. Filter features from the annotation files not matching sequences in the gff, then filter the annotations on longest isoform and ORF size of the CDS. Provide statistics of the filtered data.

proteins

Extract protein sequences from the filtered genomes with CDS features in the filtered annotations. Provide statistics of the protein fasta file contents.

functions

Create functional annotations from extracted protein sequences of the filtered data using InterProScan.

About

A quality control pipeline for PanTools input data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%