Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: align MolOpt functions with GuacaMol implementation #47

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

AustinT
Copy link

@AustinT AustinT commented May 19, 2024

My testing showed that MolScore's version of some of the MolOpt (GuacaMol) functions differs from that of the original package. This PR applies some (partial) fixes.

  1. Fingerprint size: the original GuacaMol functions use a $2^{32}$ bit (sparse) fingerprints, whereas the implementations in MolScore use a 1024 bit (dense) fingerprint. This causes all similarity values to be slightly overestimated. I increased the fingerprint length to 16384 to reduce the overestimation.
  2. Fix Deco hop threshold: the implementation here used a threshold of 0.75, but the paper and code use 0.85.
  3. Align some MPO modifiers with GuacaMol's code in instances where paper and code differ. I didn't realize that GuacaMol's official implementation of their functions differs from the paper in several places. In these instances, I think it is better to side with their official code, because this is what prior works have used.
    • Fexofenadine_MPO: the paper specifies an STD of 2 for the TPSA and logP modifies, but the code uses 10 and 1 respectively. This causes scores to generally be a lower in GuacaMol's implementation compared to MolScore.
    • Osimertinib_MPO: sigmas are also different (original code)
  4. Fixed SMARTS string in Scaffold hop (in response to Bug in scaffold hop task from PMO benchmark #45 )
  5. Fixed bug in isomer similarity calculation: the GuacaMol paper and code specify that the element-wise difference is taken with respect to elements in the target molecule only, while MolScore's implementation uses both the target and query molecules. A simple 1-line change fixes this.
  6. Fixed fingerprint lengths in legacy QSAR: JNK3 and GSK3B both use radius 2 fingerprints, not radius 3.

Also, I removed a spurious resource called molscore.configs.MolOpt-DF which does not exist in the repo (maybe it exists for you locally??)

Currently I have this PR marked as a draft. This is because

  1. The functions still don't seem to be 100% aligned (I still found small differences in Sitagliptin MPO, Valsartan SMARTS, and scaffold hop)
  2. If these fixes are accepted, they should be made in other places in the repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant