README

Introduction

In this project, naive bayes model, support vector machine and neural network are implemented for recognizing the DNA binding sites on protein sequence. Through leave one out cross validation and t-test, it is indicated that support vector machine has the best performance, while the performance of the naive bayes model is the worst

Data:

Data Collection

The list of the proteins were obtained from the article, which includes 56 different types DNA-protein complexes. They are listed as below:

All the pdb file and the interaction file were downloaded from NPIDB.

Data Preprocess

Generate the residue sequence of the protein from the pdb file

Extract the residue sequence from the pdb file into a list to represent a chain of proteins
Convert the residue name to fasta 1-letter residue name

Generate the label sequence of the protein from the pdb file and the interactive file

In the interactive file, it records all the residues whose distances to the DNA molecules are smaller than 3.5 Angstrom.

Check the interation file and the corresponding chain sequence of the specific protein
Filter the interaction file to remove the unrelated record
Mark the label of the specific residue indicated in the interaction file as 1. The other labels are 0
For each sequence of the proteins, add four fake residues at the beginning and add four fake residues at the end of the beginning. In this way, each real residue in the sequence has the same number of neighboring residues for feature extraction (see the figure below).

Features

After obtaining the dataset, it is also critical to determine the features for the instance in the models. Here, two different types of the features are considered:

the neighboring residue sequence.
the electrostatic potential of the residues in the neighboring residues.

Neighboring residue sequence

The length of the neighboring residue sequence will be 9. In this neighboring residue sequence, the first four residues are the residues before the target residue and the last four residues are the residues after the target residue. The middle residue is the target residue itself(See the figure below).

Electrostatic Potential

There are many ways to calculate the electrostatic potential. A simple way is to assign the electrostatic potential to -1, 0, or 1 according to the specific type of the residues (see the table below).

Discrete Value of electrostatic potential	residues
Positive	Arg, Lys, His
Negative	Asp, Glu
Neutral	All others

We also still consider the electrostatic potential of 9 residues for each residue in the model.

Postition-specific scoring matrix(PSSM)

In Support Vector Machine(SVM) and Neural Network (NN), it is necessary to convert the features of each residue which are residue sequence into the float numbers. Some researchers indicated that using Postition-specific scoring matrix(PSSM) to represent the neighboring residue sequence is a promising approach. PSSM represents the probabilities of the residues appeared at a specific position(see the figure below).

After building the PSSM for the training set, we convert the neighboring residue sequence of each residue to a number sequence by this matrix. This number sequence represents a vector of features for a instance.

Techniques

Naive Bayes model (Reference)
Support Vector machine (Reference)
Artificial Neural Network (Reference)

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
PDB		PDB
dataset		dataset
interaction		interaction
pictures		pictures
presentation		presentation
.gitignore		.gitignore
JunWang.pdf		JunWang.pdf
Proposal.md		Proposal.md
README.md		README.md
nbm1.py		nbm1.py
nbm2.py		nbm2.py
nn1.py		nn1.py
nn2.py		nn2.py
preprocess.py		preprocess.py
smallData.csv		smallData.csv
svm1.py		svm1.py
svm2.py		svm2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

Introduction

Data:

Data Collection

Data Preprocess

Generate the residue sequence of the protein from the pdb file

Generate the label sequence of the protein from the pdb file and the interactive file

Features

Techniques

About

Releases

Packages

Languages

JoshuaW1990/simpleDNAandProt

Folders and files

Latest commit

History

Repository files navigation

README

Introduction

Data:

Data Collection

Data Preprocess

Generate the residue sequence of the protein from the pdb file

Generate the label sequence of the protein from the pdb file and the interactive file

Features

Techniques

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages