A graph neural network model for predicting Cas1 protein.
The CRISPR-Cas system, an adaptive immune mechanism found in bacteria and archaea, has evolved into a promising genome editing tool. Cas proteins, including Cas1, play vital roles in acquiring spacer sequences and integrating foreign nucleic acids. In this study, we first gathered and analyzed a comprehensive collection of CRISPR-associated (Cas) proteins, ranging from Cas1 to Cas14. Specifically, we focused on Cas1 and converted these proteins into the simplified molecular-input line-entry system (SMILES) format to construct graph data representing atom and bond features. Next, two GNN models were designed using the directed message passing neural network (DMPNN) framework, and these models were trained on two carefully curated Cas1 graph datasets. Subsequently, the performance of these models on both the training data and newly designed datasets was evaluated, and then compared with a widely used non-deep learning method. Finally, the established models were used to identify new Cas1 proteins within the Ensemble database. Our models demonstrated their effectiveness in identifying previously unknown Cas1 proteins, highlighting their robustness and practical utility. In conclusion, our models serve as a valuable auxiliary tool for Cas1 protein identification, and contribute to the innovative application of SMILES encoding in the study of biomacromolecules.
Cas_data contains all the Cas proteins we have collected, from Cas1 to Cas14; Cas_data_TSNE performs tsne analysis based on Cas_data; Cas1_data contains all the training data used for our project. Bacteria protein URL is where to download bacteria proteins.
- Cas_data: https://github.com/chengaoxiang1985/Cas1_pre/tree/main/data/Cas_data
- Cas_data_TSNE: https://github.com/chengaoxiang1985/Cas1_pre/tree/main/data/Cas_data_TSNE
- Cas1_data: https://github.com/chengaoxiang1985/Cas1_pre/tree/main/data/Cas1_data https://github.com/chengaoxiang1985/Cas1_pre/tree/main/data/fasta2smiles
- Bacteria protein URL: https://github.com/chengaoxiang1985/Cas1_pre/tree/main/data/bacteria_protein_URLs
- data_for_pre_cas1 the data set predicted by model 1 and the predicted results; data_for_pre_cas1_use_model2 the data set predicted by model 2 and the predicted results.
- data_for_pre_cas1: https://github.com/chengaoxiang1985/Cas1_pre/tree/main/data/data_for_pre_cas1
- data_for_pre_cas1_use_model2: https://github.com/chengaoxiang1985/Cas1_pre/tree/main/data/data_for_pre_cas1_use_model2
- Model 1: Cas1_with_nocas_smiles_checkpoints_2023-5-8
https://github.com/chengaoxiang1985/Cas1_pre/tree/main/data/Cas1_with_nocas_smiles_checkpoints_2024-5-8 - Model 2: Cas1_with_nocas_smiles_checkpoints_2023-6-26
https://github.com/chengaoxiang1985/Cas1_pre/tree/main/data/Cas1_with_nocas_smiles_checkpoints_2024-6-26
import os
# os.environ["CUDA_VISIBLE_DEVICES"]="-1"
os.environ["CUDA_VISIBLE_DEVICES"]="1"
import sys
sys.path.append(r"/data/home/chengaoxiang/chemprop")
import chemprop
if __name__ == '__main__':
arguments = [
'--data_path', train_data_path,
'--dataset_type', dataset_type,
'--save_dir', save_model_dir,
'--num_folds', '3',
'--hidden_size', '1200',
'--depth', '3',
'--dropout', '0.3',
'--ensemble_size', '5',
'--ffn_num_layers', '3',
'--num_workers', '8',
'--batch_size', '80',
'--epochs', '30']
args = chemprop.args.TrainArgs().parse_args(arguments)
# print(args)
# training start
mean_score, std_score = chemprop.train.cross_validate(args=args, train_func=chemprop.train.run_training)
import os
# os.environ["CUDA_VISIBLE_DEVICES"]="-1"
os.environ["CUDA_VISIBLE_DEVICES"]="1"
import sys
sys.path.append(os.path.join(os.getcwd(),'chemprop'))
sys.path.append(os.path.dirname(os.getcwd()))
import chemprop
print('Program starts......')
predict_arguments = [
'--test_path', predict_data_path,
'--preds_path', predict_result_path,
'--checkpoint_dir', save_model_dir
]
predict_args = chemprop.args.PredictArgs().parse_args(predict_arguments)
preds = chemprop.train.make_predictions(args=predict_args)
1.Install all packages listed on this page.
2.Download or clone the repository.
3.Follow the examples on this page to retain new model or use our trained models.
(It is important to note that some paths in the code file need to be adjusted according to the actual situation.)
- Python 3.9.13
- Pytorch 1.12.1.
- chemprop
- rdkit
- numpy
- csv
- tqdm
- Intel® Xeon(R) Gold 6248R CPU @ 3.00GHz × 4 (Intel, Santa Clara, CA, USA)
- Ubuntu 7.5.0-3ubuntu1~18.04 (Linux version 4.15.0-112-generic) and 754 GB RAM
- Four NVIDIA Tesla V100S PCIe GPUs (32GB each, CUDA Version: 11.6)
This research is supported by Key Research Project of Zhejiang Lab (No. 117005-AC2106/001)