Protein Domain Stability Predictor

The goal of this project is to explore how different architectures and how different datasets, and features of datasets, impact the accuracy of predicting fitness and stability of protein domains.

Currently 2 architectures have been tried:

An RNN autoencoder to represent a sequence, with a downstream fitness predictor.
ESM2 embeddings with downstream stability prediction module.
AMPLIFY embeddings with downstream prediction module.

Environment

You can set up the environment for this package using conda using the command: conda env create -f environment.yml.

Usage

Prepare your dataset and update the config.py file with the correct paths and parameters. Ensure you add the details of your dataset to the DATASETS dictionary.
Run the main script with: python src

Configuration

All configurations such as data paths, model parameters, and training parameters can be found and updated in the config.py file.

Flags

--device - Can overide automatic device specification, to run package on CPU, MPS or CUDA.
--embeddings - Can be used to
--splits - Choose how to split data; homologous-aware to ensure homologous domains are in the same split (requires domain family file), or random for entirely random assigning.
--tune - Choose whether to tune hyperparameters; if blank, hyperparameters in the config will be used, grid-search iterates over every possible value within config ranges, random-search searches random values within config ranges.

So for example, python src --splits homologous-aware --tune random-search would run with the automatically determined best-available device, the data would be split ensuring homologous domains are in the same splits, and the optimum hyperparameters would be searched for randomly.

Modules

__main__.py - The main function which reads flags and executes functions based on user selection.
config_loader.py - Reads the values in the config file and parses them for the rest of the package to use.
datasets.py - Defines the dataset class to store data in, and contains a function to set up datasets from config.
helpers.py - Contains small functions used by other modules.
inference.py - Contains functions to run trained models to get predictions from datasets.
models.py - Contains model classes, and function to set it up from config.
preprocessing.py - Contains functions for processing data before it is used for training or inference.
embeddings.py - Contains functions for fetching and using protein language models to generate upstream embeddings.
splits.py - Contains functions to read homology files and split data based on config.
training.py - Contains functions to train models on datasets.
tuning.py - Contains functions to find optimal hyperparameters.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
LICENSE		LICENSE
README.md		README.md
config.json		config.json
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Domain Stability Predictor

Environment

Usage

Configuration

Flags

Modules

About

Releases

Packages

Languages

License

Ronnie-Crawford/protein-stability-prediction

Folders and files

Latest commit

History

Repository files navigation

Protein Domain Stability Predictor

Environment

Usage

Configuration

Flags

Modules

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages