ParaGone

Current version: 1.1.3 (July 2024). See the change_log.md here

Purpose

ParaGone is a Python package (with external dependencies) designed to identify ortholog groups from a set of paralog sequences from multiple taxa. This process has been decribed as both paralogy resolution and orthology inference. For phylogenetic analyses, the correct inference of ortholog groups is critical in groups showing frequent gene or genome duplication, such as many families of land plants in which polyploidy is prevalent. Phylogenetic analysis of unrecognized paralogous gene copies can produce incorrect topologies, as the evolutionary history of gene families interferes with the evolutionary history of species lineages.

ParaGone implements multiple different algorithms to infer ortholog groups. These algorithms are applied to gene trees that contain paralogs, and ortholog groups are inferred based on algorithm-specific parsing of the tree topologies. The pipeline uses algorithms described and implemented by Yang and Smith 2014 here. These approaches have since been adapted for target capture datasets as described in the manuscript here. The original documentation and scripts can be found here.

ParaGone provides a single installable package that comprises many of the original Yang and Smith scripts, together with new pipeline steps and modules. The Yang and Smith scripts used have been updated, modified, and extended. In addition, the ParaGone pipeline includes many new report and logging files, allowing the user to trace the processing of input paralog sequences through the pipeline for all orthology inference algorithms employed. The pipeline can be run with a single command, or via six discreet steps.

Dependencies

Python >=3.6,<=3.9.16 (this maximum version constraint is due to an issue with Treeshrink), along with the Python libraries:
- progressbar2. The conda install can be found here.
- ete3
- biopython 1.79 or later
MAFFT
ClustalOmega
Julia NEW for version 1.1.0, along with the Julia module:
- ArgParse
TAPER. The TAPER script correction_multi.jl should be in your $PATH. NEW for version 1.1.0. TAPER replaces HmmCleaner.pl in ParaGone version 1.1.0
Trimal version 1.4
IQTREE
FastTree. The conda install can be found here.
Treeshrink

Setup

We strongly recommend installing ParaGone using conda with a new environment. This will install ParaGone, all required Python packages, and all required external programs. If you have conda installed and the channels bioconda and conda-forge have already been added, this can be done using the command:

conda create -n paragone paragone

...followed by:

conda activate paragone

For full installation instructions, including details on how to install on Macs with Apple Silicon (M1/M2/M3 chips), please see our wiki page:

https://github.com/chrisjackson-pellicle/ParaGone/wiki/Installation

Once all dependencies are installed you can download a test dataset here, extract the *.tar.gz file, and execute the run_paragone_test_dataset.sh script from the test_dataset directory for a demonstration of ParaGone.

Pipeline Input

Full instructions on running the pipeline, including a step-by-step tutorial using a small test dataset, are available on our wiki:

https://github.com/chrisjackson-pellicle/ParaGone/wiki/Tutorial

Paralog sequences

A folder containing a multi-fasta file for each gene, with paralog sequences. If these have been generated using HybPiper, each fasta file will contains the 'main' contig selected by HybPiper for each sample. Where HybPiper has detected putative paralog contigs, these sequences are also included; in such cases, the main contig has the fasta header suffix .main, whereas putative paralogs have the suffix .0, .1 etc.

Outgroup sequences

Some of the paralogy resolution methods used in this pipeline require an outgroup sequence for each of your genes. These outgroup sequences can be provided in two ways.

Designating one or more taxa in your HybPiper paralog files as outgroups, via the --internal_outgroup parameter. For example, if your paralog fasta files contain sequences from the taxa 79686 and 79689, you could designate these sequences as outgroups using --internal_outgroup 79686 --internal_outgroup 79689.
Providing a fasta file (e.g. outgroups.fasta) containing 'external' outgroup sequences via the option --external_outgroups_file outgroups.fasta. The sequences in the file should have the same fasta header formatting and gene names as your paralog fasta files. For example, if you have used the Angiosperms353 target file for your HybPiper analysis, and you wish to use sequences from Sesame as your outgroup, your outgroups.fasta file might contain the following:
```
>sesame-6995
gtgggatatgaacaaaatccattgagcttgtattactgtta...
>sesame-4757
ctggtgcgtcgagcacttctcttgagcagcaacaatggcgg...
>sesame-6933
gaagtagatgctgtggtggtggaagcattcgacatatgcac...

...etc
```

Again, note that the gene identifier following the dash in the fasta headers (e.g. 6995 for header >sesame-6995) needs to correspond to a gene identifier in your paralog fasta files.

It's fine if your outgroups.fasta file contains additional sequences that you don't want to use. When running the pipeline (see below) you can optionally provide one or more taxon names using the parameter --external_outgroup, e.g. --external_outgroup <taxon_id1> --external_outgroup <taxon_id2>, and only these taxa will be included as outgroups. If this option isn't provided, all taxa/sequences in the outgroups.fasta file will be used.

NOTE: at a minimum, you must provide either 'internal' ingroups via the --internal_outgroup parameter, OR a file of 'external' outgroup sequences via the --external_outgroups_file parameter. You can also provide both 'internal' and 'external' outgroups.

Pipeline Output

IMPORTANT: Unless you use the flag --keep_intermediate_files when running the command paragone full_pipeline (if running the pipeline with a single command) or paragone final_alignments (if running the pipeline in stages), most intermediate files and folders generated by the pipeline will be deleted. The information below assumes you have used the --keep_intermediate_files flag.

Running the pipeline will produce up to 28 output folders. The data in these folders can be useful for tracking and debugging. However, if you're just after the aligned .fasta sequences for each of your target genes (as output by each of the paralogy resolution methods), the main output folders of interest are probably:

23_MO_final_alignments
24_MI_final_alignments
25_RT_final_alignments
26_MO_final_alignments_trimmed
27_MI_final_alignments_trimmed
28_RT_final_alignments_trimmed

For a full description of ParaGone output, see the wiki.

Name		Name	Last commit message	Last commit date
Latest commit History 203 Commits
paragone		paragone
LICENSE.txt		LICENSE.txt
README.md		README.md
change_log.md		change_log.md
setup.py		setup.py
test_dataset.tar.gz		test_dataset.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParaGone

Purpose

Dependencies

Setup

Pipeline Input

Paralog sequences

Outgroup sequences

Pipeline Output

About

Releases 5

Packages

Languages

License

chrisjackson-pellicle/ParaGone

Folders and files

Latest commit

History

Repository files navigation

ParaGone

Purpose

Dependencies

Setup

Pipeline Input

Paralog sequences

Outgroup sequences

Pipeline Output

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages