You can generate a similar app as described in https://github.com/dataprofessor/bioactivity-prediction-app
Around 60% of the notebook coding, especially the key method of calculating molecular descriptors are borrowed from https://github.com/dataprofessor as claimed inside the notebooks.
Around 40% of the notebook is written from scratch, especially the local version. In case you need a reference, the following may be considered
The paper describing how to calculate molecular descriptors https://peerj.com/articles/2322/
chembl database https://www.ebi.ac.uk/chembl/ from where we obtain the public data.
sk-learn https://scikit-learn.org/stable/ to build the regression model.
Windows, Mac or Linux either is fine as long as you have a Jupyter Notebook installed. For people who have never used Jupyter Notebooks, here is a great article to get started: https://link.springer.com/protocol/10.1007/978-1-0716-0150-1_3.
Windows: Directly download the ZIP file and extract it to use.
Unix-like terminal and Mac
git clone https://github.com/quantaosun/QSAR-COVID-19.git
then you can use Jupyter Notebook to run the notebook in order.
Though the training looks good, the test is not.
--------------100% as training set---------------------- 80% as training set
----------20% tested
As of January 9th 2022, the pandemic has been there for almost 2 years, with an unprecedented number of infected people, there is also a great increase in relative research data like chemical compounds that could potentially inhibit the virus. To date when this is written, around 14,355 bioactivities toward small molecules, have been recorded in the public Chembl database.
A fundamental question is then, can we build a QSAR model for all these data? This is precisely what this project tries to do. Not all bioactivities are comparable to each other, so this QSAR model will not be perfect but it will give you a sense of the progress of small-molecules based inhibitors development against COVID-19.
The QSAR model is to take molecular descriptors as independent variables, and bioactivities as dependent variables, with the help of machine learning model random forest regression, to build a QSAR model either based on public Chembl bioactivity or your local bioactivity for SARS-COV-2 or if you could change the target, for any other target.
run 1_public.ipynb, 2_build_public,3_build_public,5_build_all in sequence to build a QSAR, then run 3_external_prediction,5_external_prediction to predict unknown molecules.
Before you could do the external prediction, you should create a file called "unknow.txt" containing all the smiles you want to predict or validate.
run 1_local.ipynb, 3_build_local,5_build_all in sequence to build a QSAR, then run 3_external_prediction,5_external_prediction to predict unknown molecules.
Before you could run a local version, do the following beforehand,
- Prepare a text file that contains all the molecules' smiles string, you can obtain it from ChemDraw or any other means you prefer, note that there are various variants of smiles, what I used here is the conventional one. All strings should be put in a one-per-line manner. save it as "structures.txt", see the attached example.
- Prepare another text file that contains all the bioactivities in a one per line manner, to match up the first "bioactivity.txt" file, see the attached example.
Before you could do the external prediction, you should create a file called "unknow.txt" containing all the smiles you want to predict or validate.
For the sake of this example, I only used 139 molecules with IC50, but there are actually thousands of other bioactivities available out there, so check it out yourself and see if you could improve the model performance. Alternatively, you can use Kd, Ki or what you like as the bioactivity, just remember you can't build a QSAR used different tpyes.