This is the repository for MolScribe, an image-to-graph model that translates a molecular image to its chemical structure. Try our demo on HuggingFace first!
@article{qian2022robust,
title={Robust Molecular Image Recognition: A Graph Generation Approach},
author={Qian, Yujie and Tu, Zhengkai and Guo, Jiang and Coley, Connor W and Barzilay, Regina},
journal={arXiv preprint arXiv:2205.14311},
year={2022}
}
Run the following command to install the package and its dependencies:
git clone [email protected]:thomas0809/MolScribe.git
cd MolScribe
python setup.py install
Download the MolScribe checkpoint from HuggingFace Hub and predict molecular structures:
import torch
from molscribe import MolScribe
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download("yujieq/MolScribe", "swin_base_char_aux_1m.pth")
model = MolScribe(ckpt_path, device=torch.device('cpu'))
smiles, molblock = model.predict_image_file('assets/example.png')
Alternatively, manually download the checkpoint and instantiate MolScribe with the local path.
For development or reproducing the experiments, follow the instructions below.
Install the required packages
pip install -r requirements.txt
For training or evaluation, please download the corresponding datasets to data/
.
Training data:
Datasets | Description |
---|---|
USPTO Download |
Downloaded from USPTO, Grant Red Book. |
PubChem Download |
Molecules are downloaded from PubChem, and images are dynamically rendered during training. |
Benchmarks:
Category | Datasets | Description |
---|---|---|
Synthetic Download |
Indigo ChemDraw |
Images are rendered by Indigo and ChemDraw. |
Realistic Download |
CLEF UOB USPTO Staker ACS |
CLEF, UOB, and USPTO are downloaded from https://github.com/Kohulan/OCSR_Review. Staker is downloaded from https://drive.google.com/drive/folders/16OjPwQ7bQ486VhdX4DWpfYzRsTGgJkSu. ACS is a new dataset collected by ourself. |
Perturbed Download |
CLEF UOB USPTO Staker |
Downloaded from https://github.com/bayer-science-for-a-better-life/Img2Mol/ |
Our model checkpoints can be downloaded from Dropbox or HuggingFace Hub.
Model architecture:
- Encoder: Swin Transformer, Swin-B.
- Decoder: Transformer, 6 layers, hidden_size=256, attn_heads=8.
- Input size: 384x384
Download the model checkpoint to reproduce our experiments:
mkdir -p ckpts
wget -P ckpts https://huggingface.co/yujieq/MolScribe/resolve/main/swin_base_char_aux_200k.pth
python predict.py --model_path ckpts/swin_base_char_aux_200k.pth --image_path assets/example.png
MolScribe prediction interface is in molscribe/interface.py
.
See python script predict.py
or jupyter notebook notebook/predict.ipynb
for example usage.
bash scripts/eval_uspto_joint_chartok.sh
The script uses one GPU and batch size of 64 by default. If more GPUs are available, update NUM_GPUS_PER_NODE
and
BATCH_SIZE
for faster evaluation.
bash scripts/train_uspto_joint_chartok.sh
The script uses four GPUs and batch size of 256 by default. It takes about one day to train the model with four A100 GPUs.
During training, we use a modified code of Indigo (included in molscribe/indigo/
).
We implement a standalone evaluation script evaluate.py
. Example usage:
python evaluate.py \
--gold_file data/real/acs.csv \
--pred_file output/uspto/swin_base_char_aux_200k/prediction_acs.csv \
--pred_field post_SMILES
The prediction should be saved in a csv file, with columns image_id
for the index (must match the gold file),
and SMILES
for predicted SMILES. If prediction has a different column name, specify it with --pred_field
.
The result contains three scores:
- canon_smiles: our main metric, exact matching accuracy.
- graph: graph exact matching accuracy, ignoring tetrahedral chirality.
- chiral: exact matching accuracy on chiral molecules.