It is a WDL-based tool designed to streamline microbiome data processing for assembling, polishing, annotating, and visualizing prokaryotic genomes. It simplifies the analysis process from raw sequencing reads to high-quality annotated genomes, making it accessible to both novice and experienced researchers. The pipeline begins with quality control (QC) to assess and filter sequencing reads, removing low-quality reads. This is followed by de novo assembly to construct a draft genome, which undergoes four rounds of polishing to enhance accuracy and eliminate residual errors. After assembly refinement, the workflow performs genome annotation, identifying key features such as coding sequences, tRNAs, and rRNAs. The final step produces visualizations and comprehensive reports, providing insights into genome structure and content.
- Create conda environment:
conda create -n prokaryome
conda activate prokaryome
- configure Conda channels:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
- Install the following tools:
conda install bioconda::bwa
conda install bioconda::samtools
conda install bioconda::pilon
conda install bioconda::ragtag
The remaining tools are pulled from Docker Hub, so there is no need to install them.
- Install
Cromwell
since it is the execution engine that compile and run WDL workflows.
conda install bioconda::cromwell
Once you've installed, you can write and run WDL workflows
- Download Prokaryome
git clone https://github.com/saifeldeen-bio/Prokaryome.git
unzip Prokaryome
cd Prokaryome/
sudo mv extractDraft ../path-to/usr/bin
sudo mv Prokaryome-PE.wdl ../path-to/usr/bin
In your home directory run
nano ~/.bashrc
Add the following alias to the bashrc file
alias Prokaryome-PE='cromwell run /usr/bin/Prokaryome-PE.wdl'
Then Save and exit
{
"Prokaryome.raw_reads": [
{
"left": "raw_reads/SRR00000000_1.fastq.gz",
"right": "raw_reads/SRR00000000_2.fastq.gz"
}
],
"Prokaryome.reference": "ref/dmel-all-chromosome-r6.46.fasta",
"Prokaryome.trim_sliding_window": "4:25",
"Prokaryome.trim_read_min_length": "36",
"Prokaryome.trim_adapter_file": "adapters/adapters.fa",
"Prokaryome.trim_head_crop": "0",
"Prokaryome.trim_trailing_crop": "0"
}
Prokaryome.raw_reads
: Your FASTQ files as an array of paired files for paired-end reads.Prokaryome.trim_sliding_window
: The sliding window size used for trimming in Trimmomatic.Prokaryome.trim_read_min_length
: The minimum read length to retain after trimming in Trimmomatic.Prokaryome.trim_adapter_file
: The adapter sequences to be removed from your reads. You can add custom sequences to remove. If you do not want to remove any adapters or sequences, provide a blank file.Prokaryome.trim_head_crop
: Trims a specified number of bases from the start of each read, useful for removing overrepresented sequences at the beginning.Prokaryome.trim_trailing_crop
: Trims a specified number of bases from the end of each read.
The workflow expects the following directory structure:
project/
├── input.json # inputs file
├── raw_reads/ # Contains paired-end FASTQ files
│ ├── sample_1.fastq.gz
│ └── sample_2.fastq.gz
├── ref/ # Contains the reference genome
| └── reference.fasta
├── adapters/ # Contains the reference genome
└── adapters.fa
Prokaryome-PE -i inputs.json
-
Quality Control:
- Tool:
FastQC
&MultiQC
- Evaluates the quality of raw sequencing reads.
- Tool:
-
Trimming:
- Tool:
Trimmomatic
- Removes low-quality bases.
- Tool:
-
Assembly:
- Tool:
Spades
- Constructs the genome assembly.
- Tool:
-
Polishing:
- Tool:
BWA
,Samtools
, andPilon
- Refines the assembly across four rounds to improve accuracy.
- Tool:
-
Assembly Quality Assessment:
- Tool:
Quast
- Evaluates the quality of the assembly compared to the reference genome.
- Tool:
-
Draft Genome Generation:
- Tool:
RagTag
&extract_draft.py
- Generates a draft genome scaffold using the polished assembly and reference.
- Tool:
-
Annotation:
- Tool:
Prokka
- Annotates the genome with genes and functional information.
- Tool:
-
Visualization:
- Tool:
Genovi
- Visualizes annotated genome features.
- Tool:
Install the following software as prerequisites:
Cromwell
: Workflow engine for running WDL files.BWA
For mapping reads during polishing.Samtools
For BAM file handling.Pilon
For assembly polishing (not dockerized).RagTag
For draft genome generation.
- FastQC reports for individual samples.
- MultiQC summary report consolidating all quality metrics.
- Quast Assembly Reports
-
contigs.fasta
- Contains assembled contigs.
- Primary output for downstream analysis.
-
scaffolds.fasta
- Contains assembled scaffolds (contigs connected with gaps).
- Useful if your data supports scaffolding (e.g., paired-end reads).
-
assembly_graph.fastg
- A FASTG file representing the assembly graph.
- Useful for visualizing and analyzing the assembly structure.
-
assembly_graph_with_scaffolds.gfa
- A GFA format graph that includes scaffold information.
- Suitable for genome assembly graph viewers.
-
contigs.paths
- Contains the paths of contigs through the assembly graph.
-
scaffolds.paths
- Contains the paths of scaffolds through the assembly graph.
-
spades.log
- A detailed log of the SPAdes run.
-
params.txt
- Lists the parameters used for the SPAdes run.
-
input_dataset.yaml
- Describes the input data provided to SPAdes.
-
contigs.stats
- Provides statistics on the assembled contigs (e.g., length, coverage).
-
scaffolds.stats
- Provides statistics on the assembled scaffolds.
-
corrected/
directory- Contains reads that were error-corrected during the assembly process.
-
misc/
directory- Includes intermediate files and additional data used in the assembly process.
- Prokka outputs, including GenBank, GFF, and Annotation tables.
- Figures and graphical outputs from Genovi like the circular viewer of genome, numbers of Clusters of Orthologous Groups of proteins (COGs) subcategories and their frequencies
- Babraham Bioinformatics, FastQC
- Ewels, Philip, et al. "MultiQC: summarize analysis results for multiple tools and samples in a single report." Bioinformatics 32.19 (2016): 3047-3048.
- Bolger, Anthony M., Marc Lohse, and Bjoern Usadel. "Trimmomatic: a flexible trimmer for Illumina sequence data." Bioinformatics 30.15 (2014): 2114-2120.
- Bankevich, Anton, et al. "SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing." Journal of computational biology 19.5 (2012): 455-477.
- Jung, Youngmok, and Dongsu Han. "BWA-MEME: BWA-MEM emulated with a machine learning approach." Bioinformatics 38.9 (2022): 2404-2413.
- Danecek, Petr, et al. "Twelve years of SAMtools and BCFtools." Gigascience 10.2 (2021): giab008.
- Walker, Bruce J., et al. "Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement." PloS one 9.11 (2014): e112963.
- Alonge, Michael, et al. "Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing." Genome biology 23.1 (2022): 258.
- Seemann, Torsten. "Prokka: rapid prokaryotic genome annotation." Bioinformatics 30.14 (2014): 2068-2069.
- Cumsille, Andrés, et al. "GenoVi, an open-source automated circular genome visualizer for bacteria and archaea." PLoS Computational Biology 19.4 (2023): e1010998.