Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching #1

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Matching #1

wants to merge 5 commits into from

Conversation

MAlberts99
Copy link
Collaborator

The PR adds the following functionality to the NMR Matching data generation script:

  • Creating randomly drawn sets of molecules from the data. The number of molecules per set is varied in a range from 2-7.
  • Create Sets with a given Tanimoto similarity, i.e. the similarity of all molecules in the set falls in an Interval of Tanimoto similarities
  • Sets where there is no match between the set and the spectra

@MAlberts99 MAlberts99 requested a review from avaucher September 4, 2023 08:11
@@ -337,7 +337,7 @@ def main(
# Load dataframe containing nmr data
logger.info("Reading data.")
nmr_df = pd.read_pickle(nmr_data)
nmr_df.drop(columns=["1H_NMR_exp", "13C_NMR_exp"], inplace=True)
#nmr_df.drop(columns=["1H_NMR_exp", "13C_NMR_exp"], inplace=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not necessary anymore: delete it. Otherwise, you can specify why it is commented out as a comment on the line before

Copy link
Member

@avaucher avaucher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

There's no real comment on things that have to change; it looks like everything is fine as it is. Most of the comments are rather as hints on things I would do differently in the future.

For instance, for type annotations I would always use List[SomeValue] instead of list, Dict[str, SomeValue] instead of dict, etc. It makes it more readable, as people know directly what format is expected / returned by a function.

Also, in general, when your functions become quite long with several levels of indentations, it can be a sign that a class can be more adequate - this leads to smaller functions and usually makes the code more readable (if the function names are descriptive).

RANDOM_SEED = 3246
DEFAULT_SEED = 3246
DEFAULT_NON_MATCHING_TOKEN = "<no_match> <no_match> <no_match> <no_match> <no_match> <no_match> <no_match> <no_match> <no_match> <no_match>"
DEFAULT_ALLOWED_ELEMENTS = set(["C", "H", "N", "O", "S", "P", "F", "Cl", "Br", "I"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DEFAULT_ALLOWED_ELEMENTS = set(["C", "H", "N", "O", "S", "P", "F", "Cl", "Br", "I"])
DEFAULT_ALLOWED_ELEMENTS = {"C", "H", "N", "O", "S", "P", "F", "Cl", "Br", "I"}

slightly more efficient

Comment on lines +32 to +33
input_data: Any, test_size: float = 0.1, val_size: float = 0.05
) -> Tuple[Any, Any, Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can: avoid Any. Would it be pd.DataFrame here?

Comment on lines +35 to +38
input_data, test_size=test_size, random_state=DEFAULT_SEED
)
train_data, val_data = train_test_split(
train_data, test_size=0.05, random_state=RANDOM_SEED
train_data, test_size=val_size, random_state=DEFAULT_SEED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make your function slightly more reusable: you can add an argument for the random seed to the function, , random_seed=DEFAULT_SEED) -> ...

)

return (train_data, test_data, val_data)


def evaluate_molecule(smiles: str) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a short docstring to clarify what it means to "evaluate" a molecule, and what True/False as a return value would mean.


formula = rdMolDescriptors.CalcMolFormula(mol)
atoms = re.findall(r"[A-Z][a-z]*\d*", formula)
atoms_clean = set([re.sub(r"\d+", "", atom) for atom in atoms])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
atoms_clean = set([re.sub(r"\d+", "", atom) for atom in atoms])
atoms_clean = {re.sub(r"\d+", "", atom) for atom in atoms}

nmr_string = " " + cnmr_string

return nmr_string
return combined_df


def make_nmr_rxn_set(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is quite long and a bit complex; if you were to refactor some of the code, this is a typical one that I would try to encapsulate in a class with small modular functions

Comment on lines +155 to +156
if nmr_input.count(" ") + 1 > DEFAULT_MAX_SEQ_LEN:
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a small comment to say (if I understand correctly) that too long sequences are ignored?

Depending on how often this happens, you may want to consider adding a logger.info or so

mols_sample = tuple(mols)

src_tgt_pairs = list()
src_tgt_pairs: List[dict] = list()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need the content to be a dictionary? If not, List[Tuple[str, str]] may be more appropriate.

rxn = rxn.replace(">", ".").replace("..", "")
idx = 0
pbar = tqdm(total=n_samples)
while len(src_tgt_pairs) < n_samples:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this while loop looks like something that wants to be a for loop instead. I am not sure if you actually need the idx variable.

You can do a break when the size reaches the desired one, and do the check for "Ran out of reactions to consider before reaching desired training samples" outside of the for loop.

Comment on lines +210 to +219
# Default mol_distribution to cover equal number of examples for the different set sizes, if non_matching is set to true -> divide by two as two examples will be added per pass
mol_distribution = [
[
i,
int(n_samples / ((n_max_mol_set_size - 1) * 2))
if non_matching
else int(n_samples / (n_max_mol_set_size - 1)),
]
for i in range(2, n_max_mol_set_size + 1)
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what is happening here. What does mol_distribution contain after that?

You can consider putting this in a separate function.

you can then probably put int(n_samples / ((n_max_mol_set_size - 1) * 2) in a separate variable before the for loop, with a descriptive name? It does not need to be evaluated at every iteration of the loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants