Skip to content

Commit

Permalink
converting to conda
Browse files Browse the repository at this point in the history
  • Loading branch information
PedroMTQ committed Feb 14, 2022
1 parent 58b4d20 commit e0350d2
Show file tree
Hide file tree
Showing 74 changed files with 931 additions and 43,089 deletions.
130 changes: 130 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
**/__pycache__

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/
1 change: 1 addition & 0 deletions Images/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

File renamed without changes
6 changes: 6 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
recursive-include Workflows/ *
include Images/*
recursive-include Resources/*
include LICENSE
include README.md
recursive-include drax/*
64 changes: 32 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,17 @@ It was built to aid in the mapping of metabolism related entities, for example t
6. *optional* Export files in data to `DRAX/Resources/metacyc/`
7. *optional* `DRAX/Resources/metacyc/` should contain: `compounds.dat`,`proteins.dat`,`reactions.dat`,`gene-links.dat`, and `genes.dat`,


**Using Metacyc is technically optional (as all databases) but since it contains high quality information, using it is usually desirable**

## Using DRAX



You can run the code below to test the execution:

python DRAX --example
drax --example

A typical run would look like:

python DRAX -i input.tsv
drax -i input.tsv



Expand All @@ -36,7 +34,7 @@ To avoid overloading these database websites, a 10 seconds pause between request
DRAX accepts the following parameters:


python DRAX -i input_path -o output_folder -db metacyc,kegg,hmdb,rhea,uniprot,pubchem -pt 10
drax -i input_path -o output_folder -db metacyc,kegg,hmdb,rhea,uniprot,pubchem -pt 10

Mandatory arguments: --input_path / -i
Optional arguments: --output_folder / -o
Expand Down Expand Up @@ -65,29 +63,31 @@ For more information on the workflows go to the respective [folder](Workflows/)
The input file should be a tab separated file that looks something like this:


| ID | ID type | entity type | search mode |
|-----------------------------|------------|-------------|-------------|
| HS08548 | metacyc | gene | |
| HMDBP00087 | hmdb | gene | gp |
| hsa:150763 | kegg | gene | global |
| P19367 | uniprot | gene | |
| 2.7.1.1 | enzyme_ec | protein | pr |
| 2.7.1.2 | kegg | protein | pg |
| 2.7.1.3 | metacyc | protein | prc |
| HMDBP00609 | hmdb | protein | |
| K00844 | kegg_ko | protein | |
| P19367 | uniprot | protein | prc,pg |
| PROTOHEMEFERROCHELAT-RXN | metacyc | reaction | |
| 14073 | hmdb | reaction | |
| R02887 | kegg | reaction | rpg,rc |
| 10000 | rhea | reaction | rp |
| CPD-520 | metacyc | compound | |
| 27531 | chebi | compound | cp |
| 937 | chemspider | compound | cprg |
| HMDB0000538 | hmdb | compound | c |
| C00093 | kegg | compound | |
| XLYOFNOQVPJJNP-UHFFFAOYSA-N | inchi_key | compound | |
| water | synonyms | compound | cr |
| ID | ID type | entity type | search mode |
|-----------------------------|-----------|-------------|-------------|
| HS08548 | metacyc | gene | |
| HMDBP00087 | hmdb | gene | gp |
| hsa:150763 | kegg | gene | global |
| P19367 | uniprot | gene | |
| 2.7.1.1 | enzyme_ec | protein | pr |
| 2.7.1.2 | kegg | protein | pg |
| 2.7.1.3 | metacyc | protein | prc |
| FERREDOXIN-MONOMER | metacyc | protein | prc |
| HMDBP00609 | hmdb | protein | |
| K00844 | kegg_ko | protein | |
| P19367 | uniprot | protein | prc,pg |
| PROTOHEMEFERROCHELAT-RXN | metacyc | reaction | |
| 14073 | hmdb | reaction | |
| R02887 | kegg | reaction | rpg,rc |
| 10000 | rhea | reaction | rp |
| CPD-520 | metacyc | compound | |
| 27531 | chebi | compound | cp |
| 962 | pubchem | compound | cprg |
| HMDB0000538 | hmdb | compound | c |
| C00093 | kegg | compound | |
| XLYOFNOQVPJJNP-UHFFFAOYSA-N | inchi_key | compound | |
| h1H20A | inchi | compound | |
| water | synonyms | compound | cr |

Each column is described below:

Expand Down Expand Up @@ -153,7 +153,7 @@ Several IDs are allowed per biological instance:

### Output

DRAX outputs 5 tsv files: `Compounds.tsv`,`Reactions.tsv`,`Proteins.tsv`,`Genes.tsv`, and `Graph.tsv`
DRAX outputs 5 tsv files: `Compounds.tsv`,`Reactions.tsv`,`Proteins.tsv`,`Genes.tsv`, and `Graph.sif`
Each of the first 4 files contain multiple instances (e.g., compound) with a tab-separated list of identifiers or other relevant information.
Specifically, all instances contain an `internal_id` which can be used for graph-based approaches cross-linking (e.g., `manuscript.py`), and often a list of identifiers and synonyms.
In the case of reactions, proteins and genes, cross-linking is available in the form of `<instance>_connected`. For example, if the user searches for all reactions of a set of proteins, then the retrieved proteins would have a list of `reactions_connected:<reaction internal_id>` depicting which reactions this protein is connected to. The same would apply for other search modes or search starting points.
Expand All @@ -170,11 +170,11 @@ Using the example above as an example (with input the enzyme EC 2.7.8.26), the o
As can be seen, the protein (i.e., `internal_id:270`) shown above is connected to the reaction `25550` which in turn is described as the following interaction between compounds: 10310 + 6731 => 21252 + 24415 + 8385. These compounds are then listed in the `Compounds.tsv` as shown above. For visualization purposess only a small transcript is shown above.


The `Graph.tsv` file contains edges between nodes (i.e. entities). For example since the protein with the internal id 270 is connected to the reaction with the internal id 25550, then there will be an edge between the **source** node 270 and the **target** node 25550. The third column in this file contains the type of interaction, which in this case would be from protein to reaction, i.e., **pr**.
The `Graph.sif` file contains edges between nodes (i.e. entities). For example since the protein with the internal id 270 is connected to the reaction with the internal id 25550, then there will be an edge between the **source** node 270 and the **target** node 25550. The third column in this file contains the type of interaction, which in this case would be from protein to reaction, i.e., **pr**.

### Example run

![example_run](Figures/example.png)
![example_run](Images/example.png)

The example contains two inputs: in the first input line the KEGG gene edh:EcDH1\_1436 with the search mode "gp", and a second input line with the ChEBI compound ID 17968 with the search mode "crp". DRAX starts by searching for information regarding the seed gene KEGG ID edh:EcDH1\_1436, parsing the result, creating a gene entity, and retrieving the connected proteins IDs (i.e., here UniProt IDs, P76458 and P23673). Since the search mode is "gp",DRAX will do a new web query and search for the protein IDs in the available databases. DRAX now retrieves information on these two proteins and creates two protein entities (one for each UniProt ID), and stops here. The connections between the gene seed entity and the protein entities constitute direct connections.
In the second seed input, DRAX receives the ChEBI compound ID 17968, which it then cross-links to other databases through a ChEBI SQL database. This cross-linking connects the ChEBI ID 17968 to the metacyc ID BUTYRIC\_ACID, KEGG ID C00246 and HMDB ID HMDB0039. DRAX then searches for information on these three compound IDs on each respective database, retrieving also information on the reactions these compounds are involved in. The information from each database is then merged internally into one single compound entity. Since the search mode is "crp", DRAX starts a new round of data retrieval, querying each database for each reaction connected to the previous compound entity. During this reaction data retrieval, DRAX also searches for information on all other compounds involved in these reactions (not shown in figure, but, e.g., for the reaction "Butanoyl-CoA + Acetate <=> Butanoic acid + Acetyl-CoA", DRAX will not search for information on the seed compound "Butanoic acid" since it was previously searched, but it will search for information on the "Butanoyl-CoA", "Acetate", and "Acetyl-CoA" compounds). During this reaction search, DRAX finds information on the proteins linked to these reactions, and, since the search mode is "crp", it then searches for information on these proteins. Among these proteins, there are two proteins (i.e., P76458 and P23673) which have been previously searched and so DRAX does not repeat the same web query, it simply connects the reaction entities to these already existing protein entities. Having finished reading the inputs, DRAX then outputs all this information, linking entities in a graph-based manner (notice how the two seed input IDs are connected in the output graph).
Expand Down
Loading

0 comments on commit e0350d2

Please sign in to comment.