In this project, naive bayes model, support vector machine and neural network are implemented for recognizing the DNA binding sites on protein sequence. Through leave one out cross validation and t-test, it is indicated that support vector machine has the best performance, while the performance of the naive bayes model is the worst
The list of the proteins were obtained from the article, which includes 56 different types DNA-protein complexes. They are listed as below:
All the pdb file and the interaction file were downloaded from NPIDB.
- Extract the residue sequence from the pdb file into a list to represent a chain of proteins
- Convert the residue name to fasta 1-letter residue name
In the interactive file, it records all the residues whose distances to the DNA molecules are smaller than 3.5 Angstrom.
- Check the interation file and the corresponding chain sequence of the specific protein
- Filter the interaction file to remove the unrelated record
- Mark the label of the specific residue indicated in the interaction file as 1. The other labels are 0
- For each sequence of the proteins, add four fake residues at the beginning and add four fake residues at the end of the beginning. In this way, each real residue in the sequence has the same number of neighboring residues for feature extraction (see the figure below).
After obtaining the dataset, it is also critical to determine the features for the instance in the models. Here, two different types of the features are considered:
- the neighboring residue sequence.
- the electrostatic potential of the residues in the neighboring residues.
Neighboring residue sequence
The length of the neighboring residue sequence will be 9. In this neighboring residue sequence, the first four residues are the residues before the target residue and the last four residues are the residues after the target residue. The middle residue is the target residue itself(See the figure below).
Electrostatic Potential
There are many ways to calculate the electrostatic potential. A simple way is to assign the electrostatic potential to -1, 0, or 1 according to the specific type of the residues (see the table below).
Discrete Value of electrostatic potential | residues |
---|---|
Positive | Arg, Lys, His |
Negative | Asp, Glu |
Neutral | All others |
We also still consider the electrostatic potential of 9 residues for each residue in the model.
Postition-specific scoring matrix(PSSM)
In Support Vector Machine(SVM) and Neural Network (NN), it is necessary to convert the features of each residue which are residue sequence into the float numbers. Some researchers indicated that using Postition-specific scoring matrix(PSSM) to represent the neighboring residue sequence is a promising approach. PSSM represents the probabilities of the residues appeared at a specific position(see the figure below).
After building the PSSM for the training set, we convert the neighboring residue sequence of each residue to a number sequence by this matrix. This number sequence represents a vector of features for a instance.