Recurrent Neural Network for Understanding Sars-Cov-2 Protein Stability

Allison Li, Jacob Evarts

University of Washington

Introduction

Our aim is to build a recurrent neural network that takes Sars-cov-2 genomic data and reads codons of the spike protein as ‘words’ in a sentence, similar to natural language processing. By doing this across coronavirus samples and strains, the model should be able to learn not only which amino acids, but which codons typically comprise the protein. Finding patterns in the codon sequence could provide information on protein stability and codon transcription preferences by coronavirus.

Installations

Miniconda and mamba are required to run the workflow. After installing Miniconda, install mamba using:

conda install mamba -n base -c conda-forge

Creating an Environment

Create an environment to test this Nextstrain workflow.

mamba env create -n rnn-environment -f envs/config.yaml

Activate the environment to use the workflow.

conda activate rnn-environment

Getting started with own input files

To run the network on your own virus dataset, you will need to provide all the metadata (GenBank Accession Numbers) to your sequences in the form of a csv file as the input to our data parsing workflow. Download the virus metadata off of the BV-BRC Database. Rename the resulting CSV file as 'all-sequences-metadata.csv' and drop into the 'data' folder. Drop your reference sequence fasta and GenBank file into the 'data' folder and name it as 'reference-sequence.fasta' and 'reference-sequence.gb' respectively.

Run workflow for Mac Users

python csv-splitting-script.py; python accession-grabbing-script.py; python full-sequence-fasta-retrieval-script; ./alignment_script.sh; ./gene-parse-script.sh; python codon-parsing-script.py

Run workflow for Windows Users

python csv-splitting-script.py; python accession-grabbing-script.py;python full-sequence-fasta-retrieval-script; dos2unix alignment_script.sh; ./alignment_script.sh; dos2unix gene-parse-script.sh; ./gene-parse-script.sh; python codon-parsing-script.py

Alternatively, if you already have the list of accessions, you can forgo some of the above steps for an expedited pipeline.

Run expedited workflow for Mac Users

./alignment_script.sh; ./gene-parse-script.sh; python codon-parsing-script.py

Run expedited workflow for Windows Users

dos2unix alignment_script.sh; ./alignment_script.sh; dos2unix gene-parse-script.sh; ./gene-parse-script.sh; python codon-parsing-script.py

Data Curation

All sequence data is from Genbank. We filtered for human coronavirus genomes that are at least 80% sequenced in the given gene of interest.

Model Architecture and Pre-Training

Model Design

We built a very simple recurrent neural network architecture that involves a RNN layer, a ReLU activation function, a fully connected layer to map to the number of features that we have, and a softmax layer to get the probabilities. The type of RNN layer, hidden layer size, and number of layers for the RNN layer can be tuned as hyperparameters.

Pre-training Techniques

We have two different types of model to represent each pre-training technique that we used to assess pre-training effects on model performance. The masked language modeling approach randomly masks a given percentage of the input sequence (15 percent by default). The model takes the masked input sequence, and predicts for the original unmasked sequence.

In the autoregressive model, for every codon in the sequence, the model tries to predict for the next codon in the sequence. The input of the model would be the original sequence, and the predicted output would be the sequence shifted by one. This is similar to N-grams.

Running the model

To run a hyperparameter grid search of the models:

python main.py

Paper

Experiments ran with this model and subsequent results can be viewed within this final report linked here.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
config		config
data		data
images		images
loader		loader
model		model
models		models
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recurrent Neural Network for Understanding Sars-Cov-2 Protein Stability

Introduction

Installations

Creating an Environment

Getting started with own input files

Data Curation

Model Architecture and Pre-Training

Model Design

Pre-training Techniques

Running the model

Paper

About

Releases

Packages

Contributors 2

Languages

License

allison-li-1016/527-virus-recurrent-learning-model

Folders and files

Latest commit

History

Repository files navigation

Recurrent Neural Network for Understanding Sars-Cov-2 Protein Stability

Introduction

Installations

Creating an Environment

Getting started with own input files

Data Curation

Model Architecture and Pre-Training

Model Design

Pre-training Techniques

Running the model

Paper

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages