Branch | Status |
---|---|
Master | |
Develop | |
Docs |
Pygenprop is a python library for programmatic exploration and usage of the EBI Genome Properties database.
At its core, the library contains five major components:
- An object model for representing the Genome Properties database as an in-memory rooted direct acyclic graph
- A parser for Genome Properties database release flat files
- A parser for Genome Properties assignment long-form files
- A parser for InterProScan TSV files
- A results class that is used to assign genome properties to one or more organisms and compare assignments across multiple organisms
- An extended results class that is used to explore the InterProScan annotations and protein sequences that support genome properties assignments across multiple organisms.
- Code for generating Micromeda files.
Pygenprop is compatible with Python 3.6 or higher (3.5 may work, but it is not tested). Requirements can be found in environment.yaml
.
pip install pygenprop
conda install -c conda-forge -c lbergstrand pygenprop
cd /path/to/pygenprop_source_dir
pip install .
Before Pygenprop can assign genome properties to an organism, it first has to gather information from the Genome Properties database. The easiest way to gain access is through the parsing of a Genome Properties Database release file. This file is found in the EBI Genome Properties Github repository and is called genomeProperties.txt
. The file is located in the repository's flatfiles folder. For each release of Genome Properties, a genomeProperties.txt
file is generated from the description files of all public genome properties.
genomeProperties.txt
files can be found at URLs in the compatibility section below using a web browser or UNIX commands such as wget
or curl
. They can also be streamed directly into Jupyter notebooks using the requests python library. Code for streaming the database into a Jupyter notebook can be found here.
Pygenprop will be continually updated to take into account changes in the schema of the Genome Properties database. Below is a compatibility table that maps between Genome Properties and Pygenprop releases.
Genome Properties Release | genomeProperties.txt URL | Compatible Pygenprop Release |
---|---|---|
1.1 | https://raw.githubusercontent.com/ebi-pf-team/genome-properties/rel1.1/flatfiles/genomeProperties.txt | >= 0.6 |
2.0 | https://raw.githubusercontent.com/ebi-pf-team/genome-properties/rel2.0/flatfiles/genomeProperties.txt | >= 0.6 |
Latest | https://raw.githubusercontent.com/ebi-pf-team/genome-properties/master/flatfiles/genomeProperties.txt | >= 0.6 |
The ./data
folder of the EBI Genome Properties Github repository contains a series of folders with information about both public and non-public genome properties. Each folder contains both a description (DESC
) file and a status (status
) file. The status file contains information on whether a property is public or not (public: 0
means that a property is not public). One can use these status files to find non-public properties. The description files for these non-public properties can be parsed using the same parser as used for genomeProperties.txt
. Each genome property object that results from the parsing of a description file has an object attribute called public which can be set to true or false to designate a property as public or not.
property_one.public = False
Pygenprop can assign genome properties to an organism from InterProScan annotation TSV files, Genome Properties long-form assignment files (created by the Genome Properties Perl library) or a list of InterPro consortium signature accessions downloaded into a Jupyter Notebook. Pre-calculated InterProScan results for UniProt proteomes and taxonomies can be downloaded (in signature accession list format) from the beta version of the InterPro website.
InterProScan generates InterProScan annotation TSV files via domain annotation of an organism's proteins. Details and install instructions for InterProScan5 can be found here. For convenience, a Docker container for installing and running InterProScan5 can be found here.
Pygenprop can be used to extract protein sequences that provide evidence for an organism possessing a genome property. To use this feature, the organism's proteome FASTA files that were annotated by InterProScan must be opened and passed to Pygenprop. See the workflow below for more details on using this feature.
Pygenprop can generate Micromeda files, which are a new SQLite3-based pathway annotation storage format that allows for the simultaneous transfer of multiple organism's Genome Properties assignments and supporting information. Examples of supporting information include the InterProScan annotations and protein sequences that support assignments. These files allow for the transfer of complete Genome properties Datasets between researchers and software applications.
A typical use case for Pygenprop will involve a researcher seeking to compute and compare Genome Properties between organisms of interest. For example, a researcher may have discovered a novel bacterium and would want to compare its functional capabilities to other bacteria within the same genus. The researcher could start the analysis by opening up a Jupyter Notebook and directly importing pre-calculated InterProScan annotations for novel and reference genomes within the same genus. Below is an example code for comparing virulence genome properties of E. coli K12 and O157:H7.
An interactive Jupyter Notebook with an extended version of this workflow, with outputs for each step, can be found here. Full API documentation is available here.
from sqlalchemy import create_engine
from pygenprop.results import GenomePropertiesResults, GenomePropertiesResultsWithMatches, \
load_assignment_caches_from_database, load_assignment_caches_from_database_with_matches
from pygenprop.database_file_parser import parse_genome_properties_flat_file
from pygenprop.assignment_file_parser import parse_interproscan_file, \
parse_interproscan_file_and_fasta_file
# Compare Properties and Steps Across Organisms
# =============================================
# Parse the flatfile database
with open('properties.txt') as file:
tree = parse_genome_properties_flat_file(file)
# Parse InterProScan files
with open('E_coli_K12.tsv') as ipr5_file_one:
cache_1 = parse_interproscan_file(ipr5_file_one)
with open('E_coli_O157_H7.tsv') as ipr5_file_two:
cache_2 = parse_interproscan_file(ipr5_file_two)
# Create results comparison object
results = GenomePropertiesResults(cache_1, cache_2,
properties_tree=tree)
# Get properties with differing assignments
differing_results = results.differing_property_results
# Get property by identifier
virulence = tree['GenProp0074']
# Iterate to get the identifiers of
# child properties of virulence
types_of_vir = [genprop.id for genprop in virulence.children]
# Get assignments for virulence properties
virulence_assignments = results.get_results(*types_of_vir,
steps=False)
# Get percentages of virulence steps assigned
# YES, NO, and PARTIAL per organism
virulence_summary = results.get_results_summary(*types_of_vir,
steps=True,
normalize=True)
# Analyze InterProScan Annotations and Protein Sequences
# That Support Genome Properties Across Organisms
# ==================================================
# Parse InterProScan files and FASTA files
with open('./E_coli_K12.tsv') as ipr5_file_one:
with open('./E_coli_K12.faa') as fasta_file_one:
extended_cache_one = parse_interproscan_file_and_fasta_file(ipr5_file_one, fasta_file_one)
with open('./E_coli_O157_H7.tsv') as ipr5_file_two:
with open('./E_coli_O157_H7.faa') as fasta_file_two:
extended_cache_two = parse_interproscan_file_and_fasta_file(ipr5_file_two, fasta_file_two)
# Create results comparison object with InterProScan match information
# and protein sequences
extended_results = GenomePropertiesResultsWithMatches(extended_cache_one,
extended_cache_two,
properties_tree=tree)
# Get lowest E-value matches for each Type III Secretion System component for E_coli_O157_H7.
extended_results.get_property_matches('GenProp0052', sample='E_coli_O157_H7', top=True)
# Get all matches for step 22 of Type III Secretion for E. coli K12.
extended_results.get_step_matches('GenProp0052', 22, top=False, sample='E_coli_K12')
# Write FASTA file containing the sequences of the lowest E-value matches for
# Type III Secretion System component 22 across both organisms.
with open('type_3_step_22_top.faa', 'w') as out_put_fasta_file:
extended_results.write_supporting_proteins_for_step_fasta(out_put_fasta_file,
'GenProp0052',
22, top=True)
# Create a SQLAlchemy engine object for writing a Micromeda file.
engine_proteins = create_engine('sqlite:///ecoli_compare.micro')
# Write the results to the file.
extended_results.to_assignment_database(engine_proteins)
# Load results from a Micromeda file with proteins sequences.
assignment_caches_with_proteins = load_assignment_caches_from_database_with_matches(engine_proteins)
results_reconstituted_with_proteins = GenomePropertiesResultsWithMatches(*assignment_caches_with_proteins,
properties_tree=tree)
The command-line interface of Pygenprop is used primarily for generating and working with Micromeda files. It possesses three sub-commands and is installed when Pygenprop is installed.
usage: pygenprop [-h] {build,merge,info,preprocess} ...
A command-line interface for generating and manipulating Micromeda pathway annotation files.
positional arguments:
{build,merge,info,preprocess}
Available Sub-commands
build Generate a Micromeda file containing pathway annotations for one or more genomes. Supporting InterProScan and protein sequence information can also be optionally incorporated.
merge Merge multiple Micromeda files into a single output Micromeda file.
info Summarize the contents of a Micromeda file.
preprocess Replace FASTA header accessions with a numeric identifiers.
optional arguments:
-h, --help show this help message and exit
The build command is used to generate Micromeda files. It requires a copy of genomeProperties.txt
. InterProScan TSV files are used as input.
pygenprop build -d ./genomeProperties.txt -i *.tsv -o ecoli_genomes_properties.micro
The build command has a -p
flag that is used to add protein sequences to the output Micromeda file. With this flag active, Pygenprop searches the FASTA files that were scanned by InterProScan for proteins that support genome property steps and adds them to the output Micromeda file. The FASTA files must be in the same directory as the InterProScan files and share the same basename (e.g., filename without file extension).
data/
├── ecoli_one.faa
├── ecoli_one.tsv
├── ecoli_two.faa
├── ecoli_two.tsv
For the above directory structure the following shell command would be used to generate a Micromeda file that integrates protein sequences:
pygenprop build -d ./genomeProperties.txt -i *.tsv -o ecoli_genomes_properties.micro -p
The merge command is used to merge multiple Micromeda files into a single output Micromeda file. It also requires a copy of genomeProperties.txt
.
pygenprop merge -d ./genomeProperties.txt -i *.micro -o merged_ecoli_genomes_properties.micro
The info command is used to get a summary of a Micromeda file's contents.
pygenprop info -i merged_ecoli_genomes_properties.micro
The Micromeda file contains the following:
Samples: 2
Property Assignments: 2572
Step Assignments: 4644
InterProScan Matches: 2843
Protein Sequences: 1887
Documentation can be found on Read the Docs.
Please report issues to the issues page.
Apache License 2.0
N/A