diff --git a/README.md b/README.md index 39d47bc..67e325f 100644 --- a/README.md +++ b/README.md @@ -42,30 +42,30 @@ ACEGEN provides tutorials for integrating custom models and custom scoring funct --- -## Table of Contents -- [Installation](#installation) - - [Conda environment and required dependencies](#conda-environment-and-required-dependencies) - - [Optional dependencies](#optional-dependencies) - - [Install ACEGEN](#install-acegen) -- [Generating libraries of molecules](#generating-libraries-of-molecules) - - [Running training scripts to generate compound libraries](#running-training-scripts-to-generate-compound-libraries) - - [Alternative usage](#alternative-usage) -- [Advanced usage](#advanced-usage) - - [Optimization of Hyperparameters in the Configuration Files](#optimization-of-hyperparameters-in-the-configuration-files) - - [Changing the scoring function](#changing-the-scoring-function) - - [Changing the policy prior](#changing-the-policy-prior) - - [Available models](#available-models) - - [Integration of custom models](#integration-of-custom-models) -- [Results on the MolOpt benchmark](#results-on-the-molopt-benchmark) -- [De Novo generation example: docking in the 5-HT2A](#de-novo-generation-example-docking-in-the-5-ht2a) -- [Scaffold constrained generation example: BACE1 docking with AHC algorithm](#scaffold-constrained-generation-example-bace1-docking-with-ahc-algorithm) -- [Citation](#citation) +## Table of Contentsx +1. [**Installation**](#1-Installation) + - [1.1. Conda environment and required dependencies](#11-conda-environment-and-required-dependencies) + - [1.2. Optional dependencies](#12-optional-dependencies) + - [1.3. Install ACEGEN](#13-install-acegen) +2. [**Generating libraries of molecules**](#2-generating-libraries-of-molecules) + - [2.1. Running training scripts to generate compound libraries](#21-running-training-scripts-to-generate-compound-libraries) + - [2.2. Alternative usage](#22-alternative-usage) +3. [**Advanced usage**](#3-advanced-usage) + - [3.1. Optimization of Hyperparameters in the Configuration Files](#31-optimization-of-hyperparameters-in-the-configuration-files) + - [3.2. Changing the scoring function](#32-changing-the-scoring-function) + - [3.3. Changing the policy prior](#33-changing-the-policy-prior) + - [3.3.1. Available models](#331-available-models) + - [3.3.2. Integration of custom models](#332-integration-of-custom-models) +4. [**Results on the MolOpt benchmark**](#4-results-on-the-molopt-benchmark) +5. [**De Novo generation example: docking in the 5-HT2A**](#5-de-novo-generation-example-docking-in-the-5-ht2a) +6. [**Scaffold constrained generation example: BACE1 docking with AHC algorithm**](#6-scaffold-constrained-generation-example-bace1-docking-with-ahc-algorithm) +7. [**Citation**](#7-citation) --- -## Installation +## 1. Installation -### Conda environment and required dependencies +### 1.1. Conda environment and required dependencies To create the conda / mamba environment, run @@ -79,7 +79,7 @@ To install the required dependencies run the following commands. Replace `cu121` pip3 install torchrl -### Optional dependencies +### 1.2. Optional dependencies Unless you intend to define your own custom scoring functions, install MolScore by running @@ -92,7 +92,7 @@ To use the scaffold decoration and fragment linking, install promptsmiles by run To learn how to configure constrained molecule generation with ACEGEN and promptsmiles, please refer to this [tutorial](tutorials/using_promptsmiles.md). -### Install ACEGEN +### 1.3. Install ACEGEN To install ACEGEN, run (use `pip install -e ./` for develop mode) @@ -102,7 +102,7 @@ To install ACEGEN, run (use `pip install -e ./` for develop mode) --- -## Generating libraries of molecules +## 2. Generating libraries of molecules ACEGEN has multiple RL algorithms available, each in a different directory within the `acegen-open/scripts` directory. Each RL algorithm has three different generative modes of execution: de novo, scaffold decoration, and fragment linking. @@ -112,7 +112,7 @@ While the default values in the configuration files are considered sensible, a d To customize the model architecture, refer to the [Changing the model architecture](##Changing the model architecture) section. To customize the scoring function, refer to the [Changing the scoring function](##Changing the scoring function) section. -### Running training scripts to generate compoud libraries +### 2.1. Running training scripts to generate compoud libraries To run the training scripts for denovo generation, run the following commands: @@ -145,7 +145,7 @@ To run the training scripts for fragment linking, run the following commands (re python scripts/dpo/dpo.py --config-name config_linking python scripts/hill_climb/hill_climb.py --config-name config_linking -### Alternative usage +### 2.2. Alternative usage Scripts are also available as executables after installation, but both the path and name of the config must be specified. For example, @@ -157,9 +157,9 @@ YAML config parameters can also be specified on the command line. For example, --- -## Advanced usage +## 3. Advanced usage -### Optimization of Hyperparameters in the Configuration Files +### 3.1. Optimization of hyperparameters in the configuration files The hyperparameters in the configuration files have sensible default values. However, the optimal choice of hyperparameters depends on various factors, including the scoring function and the network architecture. Therefore, it is very useful to have a way to automatically explore the space of hyperparameters. @@ -170,7 +170,7 @@ To learn how to perform hyperparameter sweeps to find the best configuration for

-### Changing the scoring function +### 3.2. Changing the scoring function To change the scoring function, the easiest option is to adjust the `molscore` parameters in the configuration files. Modifying these parameters allows to switch betwewn different scoring modes and scoring objecitves. Please refer to the `molscore` section in the configuration [tutorial](tutorials/breaking_down_configuration_files.md) for a more detailed explaination. Additionally, refer to the [tutorials](https://github.com/MorganCThomas/MolScore/tree/main/tutorials) in the MolScore repository. @@ -178,9 +178,9 @@ Please refer to the `molscore` section in the configuration [tutorial](tutorials Alternatively, users can define their own custom scoring functions and use them in the ACEGEN scripts by following the instructions in this other [tutorial](tutorials/adding_custom_scoring_function.md). -### Changing the policy prior +### 3.3. Changing the policy prior -#### Available models +#### 3.3.1. Available models We provide a variety of default priors that can be selected in the configuration file. These include: @@ -219,18 +219,15 @@ We provide a variety of default priors that can be selected in the configuration - number of parameters: 5,965,760 - to select set the field `model` to `llama2` in any configuration file -#### Integration of custom models +#### 3.3.2. Integration of custom models -Users can also combine their own custom models with ACEGEN. - -A detailed guide on integrating custom models can be found in this [tutorial](tutorials/adding_custom_model.md). +Users can also combine their own custom models with ACEGEN. A detailed guide on integrating custom models can be found in this [tutorial](tutorials/adding_custom_model.md). --- -## Results on the [MolOpt](https://arxiv.org/pdf/2206.12411.pdf) benchmark +## 4. Results on the [MolOpt](https://arxiv.org/pdf/2206.12411.pdf) benchmark -Algorithm comparison for the Area Under the Curve (AUC) of the top 100 molecules on MolOpt benchmark scoring functions. -Each algorithm ran 5 times with different seeds, and results were averaged. +Algorithm comparison for the Area Under the Curve (AUC) of the top 100 molecules on MolOpt benchmark scoring functions. Each algorithm ran 5 times with different seeds, and results were averaged. The default values for each algorithm are those in our de novo configuration files. Additionally, for Reinvent we also tested the configuration proposed in the MolOpt paper. @@ -273,19 +270,19 @@ Additionally, for Reinvent we also tested the configuration proposed in the MolO --- -## De Novo generation example: docking in the 5-HT2A +## 5. De Novo generation example: docking in the 5-HT2A ![Alt Text](./acegen/images/acagen_de_novo.png) --- -## Scaffold constrained generation example: BACE1 docking with AHC algorithm +## 6. Scaffold constrained generation example: BACE1 docking with AHC algorithm ![Alt Text](./acegen/images/acegen_decorative.png) --- -## Citation +## 7. Citation If you use ACEGEN in your work, please refer to this BibTeX entry to cite it: diff --git a/tests/test_tokenizers.py b/tests/test_tokenizers.py index cca706c..1e2165b 100644 --- a/tests/test_tokenizers.py +++ b/tests/test_tokenizers.py @@ -1,4 +1,7 @@ import pytest +import warnings +from functools import partial +from rdkit.Chem import AllChem as Chem from acegen.vocabulary.tokenizers import ( AISTokenizer, @@ -8,6 +11,7 @@ SMILESTokenizerChEMBL, SMILESTokenizerEnamine, SMILESTokenizerGuacaMol, + SmiZipTokenizer, ) try: @@ -51,8 +55,293 @@ "CCN(CC)CC", # Triethylamine (C6H15N) "CC(=O)OC(C)C", # Diethyl carbonate (C7H14O3) "CC(C)C", # Isobutane (C4H10) - "CC1=CC=CC=C1", # Toluene (C7H8) + "Cc1ccccc1", # Toluene (C7H8) ] +ngrams = [ + "(=O)", + "cc", + "[C@@H]", + "CC", + "[C@H]", + "(C", + "c1ccc", + "c2ccc", + ")c", + "\t", + "\n", + "C(=O)", + "c1", + "(C)", + "c3ccc", + "c2", + "O)", + "c(", + "C(F)(F)F", + "[nH]", + "C(=O)N", + "=C", + "CCC", + "c2ccccc2", + "[N+](=O)[O-])", + "(N", + "[C@", + "c1ccccc1", + "c3", + "OC", + "(Cl)c", + "2)", + " ", + "CCN", + "COc1cc", + "#", + "3)", + "%", + "(O)", + "NC(=O)[C@H](C", + "(", + ")", + "nc", + "+", + ")cc1", + "-", + ".", + "/", + "0", + "1", + "2", + "3", + "4", + "5", + "6", + "7", + "8", + "9", + ":", + "S(=O)(=O)", + "(C)C)", + "=", + "c4ccc", + "(F)c", + "@", + "A", + "B", + "C", + "C1", + "n1", + "F", + "(-", + "H", + "I", + "C(=O)N[C@@H](C", + "K", + "L", + "M", + "N", + "O", + "P", + "c3ccccc3", + "R", + "S", + "T", + "CN", + "CCNC(=N)N)", + "c1c", + "X", + ")C", + "Z", + "[", + "\\", + "]", + "O=C(", + "CO", + "/C=C", + "a", + "b", + "c", + "c2c", + "e", + "C(N)=O", + "g", + ")N", + "i", + "(OC)c", + "C2", + "l", + "Cc1cc", + "n", + "o", + "p", + "[C@@]", + "r", + "s", + "t", + "n2", + "1C", + "=O", + "CCN(C", + "CC1", + ")n", + "C(", + "NC(=O)", + "[C@H](", + "c(-", + "[O-])", + "2C", + "[C@@H](", + "C(=O)O", + "c1n", + ")cc", + "cn", + "c(N", + "c1ccc(", + "c1ccc2c(c1)", + "c2n", + "CC(C)", + "[C@]", + "CCCCC", + "Cl", + "cc2", + "c4ccccc4", + ")c1", + "CCO", + "c4", + "(Br)c", + "[C@H]1", + "c(C", + "C(C)", + "[N+]", + "cc1", + "=N", + "CN(", + "OP(=O)(O)OC[C@H]3O[C@@H](n4cc(C)c(=O)[nH]c4=O)C[C@@H]3", + "(C#N)", + "c(=O)", + "[C@@H]1", + "C3", + "=O)", + "2)cc1", + "[n+]", + "1)", + "C(=O)C", + "c2ccc(", + "N1CC", + ")CC", + "cc(", + ".[Na+]", + "(C)C", + "c(Cl)c", + "C)", + "c2ccccc2)", + "[C@H](O)", + ")cc2)", + "(F)", + "[C@H]2", + "CC(=O)N", + "c5ccc", + "[C@@H]2", + "c3c", + "[C@@H](O)", + "C(=O)O)", + "CCCN", + "c(O", + "c2c1", + "/C=C/", + "O=C1", + "C=C", + "N2CC", + "c2ccc3c(c2)", + "c(C(F)(F)F)c", + "c1ccccc1)", + "3C", + "c3n", + "CC2)", + "4)", + "N=C", + "nn", + "cc3", + "[C@H](C", + "C(C)C)", + "c1c[nH]c2ccccc12)", + "O=C(N", + "C(=O)NC", + "c(F)c", + "C(=S)N", + "c1ccc(Cl)cc1", + "CC3)", + "Cc1", + "P(=O)(O)O", + "N1CCN(", + "(CC", + "(N)", + "c(O)c", + "c2ccc(Cl)cc2", + "nc2", + "COC(=O)", + "Cn1c", + "Cl)", + "c1ccc(F)cc1", + "/C(=C", + ")O[C@@H]3COP(=O)(S)O[C@H]3[C@@H](O)[C@H](n4c", + "O=C(O)", + "[C@@H](C)", + "COc1ccc(", + "C(N)=O)", + "S(C)(=O)=O", + "c(-c3cc", + "n3", + "OP(O)(=S)OC[C@H]", + "c3ccccc3)", + "[C@H](C)", + "[C@@H]3", + "S(=O)(=O)N", + "nc1", + "CS", + "c(-c2cc", + "CN1C(=O)", + "c1ccc(O)cc1)", + "C(=O)N1CCC", + "c(C)c", + "c3ccccc23)", + "c2ccc(F)cc2)", + "nc(N", + "C[C@H](", + "OC(C)=O)", + "CC(", + "CC[C@]", + "NC(=O)[C@@H](", + "c2ccccc12", + "ccc1", + "c(C(=O)N", + "F)", + "O=", +] +SmiZipTokenizer = partial(SmiZipTokenizer, ngrams=ngrams) +setattr(SmiZipTokenizer, "__name__", "SmiZipTokenizer") + + +def smiles_eq(smi1, smi2): + mol1 = Chem.MolFromSmiles(smi1) + mol2 = Chem.MolFromSmiles(smi2) + # Parse them + if not mol1: + return False, f"Parsing error: {smi1}" + if not mol2: + return False, f"Parsing error: {smi2}" + # Remove atom map + for mol in [mol1, mol2]: + for atom in mol.GetAtoms(): + atom.SetAtomMapNum(0) + # Check smiles are the same + nsmi1 = Chem.MolToSmiles(mol1) + nsmi2 = Chem.MolToSmiles(mol2) + if nsmi1 != nsmi2: + return False, f"Inequivalent SMILES: {nsmi1} vs {nsmi2}" + # Check InChi + inchi1 = Chem.MolToInchi(mol1) + inchi2 = Chem.MolToInchi(mol2) + if inchi1 != inchi2: + return False, "Inequivalent InChi's" + return True, "" @pytest.mark.parametrize( @@ -67,8 +356,9 @@ (SMILESTokenizerChEMBL, True, None), (SMILESTokenizerEnamine, True, None), (SMILESTokenizerGuacaMol, True, None), - # (AISTokenizer, _has_AIS, AIS_ERR if not _has_AIS else None), - # (SAFETokenizer, _has_SAFE, SAFE_ERR if not _has_SAFE else None), + (AISTokenizer, _has_AIS, AIS_ERR if not _has_AIS else None), + (SAFETokenizer, _has_SAFE, SAFE_ERR if not _has_SAFE else None), + (SmiZipTokenizer, _has_smizip, SMIZIP_ERR if not _has_smizip else None), ], ) def test_smiles_based_tokenizers(tokenizer, available, error): @@ -83,4 +373,7 @@ def test_smiles_based_tokenizers(tokenizer, available, error): assert isinstance(tokens, list) assert isinstance(tokens[0], str) decoded_smiles = t.untokenize(tokens) - assert decoded_smiles == smiles + eq, err = smiles_eq(decoded_smiles, smiles) + assert eq, err + if decoded_smiles != smiles: + warnings.warn(f"{tokenizer.__name__} behaviour: {smiles} -> {decoded_smiles}")