Reducing study bias in gene prioritisation through simulation
- A Linux or Mac operating system.
- At least 1Gb free disk space.
- Python version 2.7.
- Python packages: numpy (tested with version 1.11.0), pandas (tested with version 0.19.0), scipy (tested with version 0.18.1), sklearn (tested with version 0.0).
First download the PhenoRank repository.
The PhenoRank
python package should then be built by navigating to the main directory and entering the following two commands:
python setup.py build
python setup.py install
PhenoRank can then be run using run_PhenoRank.py
, as described below. An implementation of the PRINCE algorithm can also be run using run_PRINCE.py
.
PhenoRank is run using the run_PhenoRank.py
script contained in the main directory. To run PhenoRank, it is necessary to specify either 1) an OMIM ID for the query disease or 2) a set of phenotype terms describing the query disease.
Running PhenoRank using an OMIM term specifying the query disease:
python run_PhenoRank.py -o results.tsv -d OMIM:606070
Running PhenoRank using a set of phenotype terms describing the query disease:
python run_PhenoRank.py -o results.tsv -p 'HP:0007354;HP:0002460;HP:0001739'
PhenoRank can also be run with additional parameters, as described below.
These following parameters are accepted by run_PhenoRank.py
:
Required. Name of the file to which the results are written.
OMIM term specifying the query disease. Either this or --phenotype_ids should be specified.
Set of phenotype terms describing the query disease. Should be semi-colon separated. Either this or --omim_id should be specified.
Number of simulated diseases to use. Default is 1,000.
Restart probability to use in the RWR algorithm. Default is 0.1.
Number of iterations of the RWR algorithm to complete. Default is 20.
Can be used in benchmarking PhenoRank. If an Ensembl gene ID is specified here, then the association between this gene and the query disease is removed from the data used by PhenoRank before gene prioritisation. Can only be used when the query disease is specified using an OMIM term.
The results file generated by PhenoRank contains these columns:
- GENE: Ensembl gene identifier.
- SCORE UNRANKED UNPROP: Phenotypic-relevance score of each gene (before propagation across the PPI network) for the query disease.
- SCORE UNRANKED PROP: Phenotypic-relevance score of each gene (after propagation across the PPI network) for the query disease.
- PVALUE: P-value for each gene. Generated by comparing the phenotypic-relevance scores (after propagation across the PPI network) for the query disease against the phenotypic-relevance scores for each simulated disease.
PRINCE is run using the run_PRINCE.py
script contained in the main directory. To run PRINCE, it is necessary to specify an OMIM ID for the query disease.
Running PRINCE using an OMIM term specifying the query disease:
python run_PRINCE.py -o results.tsv -d OMIM:606070
PRINCE can also be run with additional parameters, as described below.
These following parameters are accepted by run_PRINCE.py
:
Required. Name of the file to which the results are written.
OMIM term specifying the query disease. Either this or --phenotype_ids should be specified.
Alpaca parameter value to use. Default is 0.5.
Number of iterations to complete. Default is 20.
C paramter value to use. Default is -15.
Can be used in benchmarking PRINCE. If an Ensembl gene ID is specified here, then the association between this gene and the query disease is removed from the data used by PRINCE before gene prioritisation. Can only be used when the query disease is specified using an OMIM term.
The results file generated by PRINCE contains these columns:
- GENE: Ensembl gene identifier.
- SCORE: Gene score
- Y: Phenotypic-relevance score of each gene (before propagation across the PPI network).