Matching #1

MAlberts99 · 2023-09-04T08:11:09Z

The PR adds the following functionality to the NMR Matching data generation script:

Creating randomly drawn sets of molecules from the data. The number of molecules per set is varied in a range from 2-7.
Create Sets with a given Tanimoto similarity, i.e. the similarity of all molecules in the set falls in an Interval of Tanimoto similarities
Sets where there is no match between the set and the spectra

avaucher · 2023-09-06T10:50:08Z

src/nmr_to_structure/prepare_input/prepare_nmr_rxn_input.py

@@ -337,7 +337,7 @@ def main(
    # Load dataframe containing nmr data
    logger.info("Reading data.")
    nmr_df = pd.read_pickle(nmr_data)
-    nmr_df.drop(columns=["1H_NMR_exp", "13C_NMR_exp"], inplace=True)
+    #nmr_df.drop(columns=["1H_NMR_exp", "13C_NMR_exp"], inplace=True)


If not necessary anymore: delete it. Otherwise, you can specify why it is commented out as a comment on the line before

avaucher

Thanks for the PR!

There's no real comment on things that have to change; it looks like everything is fine as it is. Most of the comments are rather as hints on things I would do differently in the future.

For instance, for type annotations I would always use List[SomeValue] instead of list, Dict[str, SomeValue] instead of dict, etc. It makes it more readable, as people know directly what format is expected / returned by a function.

Also, in general, when your functions become quite long with several levels of indentations, it can be a sign that a class can be more adequate - this leads to smaller functions and usually makes the code more readable (if the function names are descriptive).

avaucher · 2023-09-06T10:51:39Z

src/nmr_to_structure/prepare_input/nmr_utils.py

-RANDOM_SEED = 3246
+DEFAULT_SEED = 3246
+DEFAULT_NON_MATCHING_TOKEN = "<no_match> <no_match> <no_match> <no_match> <no_match> <no_match> <no_match> <no_match> <no_match> <no_match>"
+DEFAULT_ALLOWED_ELEMENTS = set(["C", "H", "N", "O", "S", "P", "F", "Cl", "Br", "I"])


Suggested change

DEFAULT_ALLOWED_ELEMENTS = set(["C", "H", "N", "O", "S", "P", "F", "Cl", "Br", "I"])

DEFAULT_ALLOWED_ELEMENTS = {"C", "H", "N", "O", "S", "P", "F", "Cl", "Br", "I"}

slightly more efficient

avaucher · 2023-09-06T10:52:31Z

src/nmr_to_structure/prepare_input/nmr_utils.py

+    input_data: Any, test_size: float = 0.1, val_size: float = 0.05
+) -> Tuple[Any, Any, Any]:


If you can: avoid Any. Would it be pd.DataFrame here?

avaucher · 2023-09-06T10:53:34Z

src/nmr_to_structure/prepare_input/nmr_utils.py

+        input_data, test_size=test_size, random_state=DEFAULT_SEED
    )
    train_data, val_data = train_test_split(
-        train_data, test_size=0.05, random_state=RANDOM_SEED
+        train_data, test_size=val_size, random_state=DEFAULT_SEED


to make your function slightly more reusable: you can add an argument for the random seed to the function, , random_seed=DEFAULT_SEED) -> ...

avaucher · 2023-09-06T10:54:12Z

src/nmr_to_structure/prepare_input/nmr_utils.py

    )

    return (train_data, test_data, val_data)


+def evaluate_molecule(smiles: str) -> bool:


I would add a short docstring to clarify what it means to "evaluate" a molecule, and what True/False as a return value would mean.

avaucher · 2023-09-06T10:54:59Z

src/nmr_to_structure/prepare_input/nmr_utils.py

+
+    formula = rdMolDescriptors.CalcMolFormula(mol)
+    atoms = re.findall(r"[A-Z][a-z]*\d*", formula)
+    atoms_clean = set([re.sub(r"\d+", "", atom) for atom in atoms])


Suggested change

atoms_clean = set([re.sub(r"\d+", "", atom) for atom in atoms])

atoms_clean = {re.sub(r"\d+", "", atom) for atom in atoms}

avaucher · 2023-09-06T11:27:05Z

src/nmr_to_structure/prepare_input/prepare_nmr_rxn_input.py

-        nmr_string = " " + cnmr_string
-
-    return nmr_string
+    return combined_df


 def make_nmr_rxn_set(


this function is quite long and a bit complex; if you were to refactor some of the code, this is a typical one that I would try to encapsulate in a class with small modular functions

avaucher · 2023-09-06T11:28:56Z

src/nmr_to_structure/prepare_input/prepare_nmr_rxn_input.py

+            if nmr_input.count(" ") + 1 > DEFAULT_MAX_SEQ_LEN:
+                continue


Add a small comment to say (if I understand correctly) that too long sequences are ignored?

Depending on how often this happens, you may want to consider adding a logger.info or so

avaucher · 2023-09-06T11:29:58Z

src/nmr_to_structure/prepare_input/prepare_nmr_rxn_input.py

    mols_sample = tuple(mols)

-    src_tgt_pairs = list()
+    src_tgt_pairs: List[dict] = list()


Do you need the content to be a dictionary? If not, List[Tuple[str, str]] may be more appropriate.

avaucher · 2023-09-06T11:34:53Z

src/nmr_to_structure/prepare_input/prepare_nmr_rxn_input.py

-        rxn = rxn.replace(">", ".").replace("..", "")
+    idx = 0
+    pbar = tqdm(total=n_samples)
+    while len(src_tgt_pairs) < n_samples:


this while loop looks like something that wants to be a for loop instead. I am not sure if you actually need the idx variable.

You can do a break when the size reaches the desired one, and do the check for "Ran out of reactions to consider before reaching desired training samples" outside of the for loop.

avaucher · 2023-09-06T11:41:03Z

src/nmr_to_structure/prepare_input/prepare_nmr_rxn_input.py

+    # Default mol_distribution to cover equal number of examples for the different set sizes, if non_matching is set to true -> divide by two as two examples will be added per pass
+    mol_distribution = [
+        [
+            i,
+            int(n_samples / ((n_max_mol_set_size - 1) * 2))
+            if non_matching
+            else int(n_samples / (n_max_mol_set_size - 1)),
+        ]
+        for i in range(2, n_max_mol_set_size + 1)
+    ]


I don't really understand what is happening here. What does mol_distribution contain after that?

You can consider putting this in a separate function.

you can then probably put int(n_samples / ((n_max_mol_set_size - 1) * 2) in a separate variable before the for loop, with a descriptive name? It does not need to be evaluated at every iteration of the loop.

Marvin-Alberts added 2 commits September 2, 2023 13:42

new matching script

e01007e

small fixes

f92b590

MAlberts99 requested a review from avaucher September 4, 2023 08:11

MAlberts added 2 commits September 5, 2023 04:52

produce cleaner rxn set

1236c86

small fixes

7821378

avaucher reviewed Sep 6, 2023

View reviewed changes

fixes

29c9f06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching #1

Matching #1

MAlberts99 commented Sep 4, 2023

avaucher Sep 6, 2023

avaucher left a comment

avaucher Sep 6, 2023

avaucher Sep 6, 2023

avaucher Sep 6, 2023

avaucher Sep 6, 2023

avaucher Sep 6, 2023

avaucher Sep 6, 2023

avaucher Sep 6, 2023

avaucher Sep 6, 2023

avaucher Sep 6, 2023

avaucher Sep 6, 2023

	DEFAULT_ALLOWED_ELEMENTS = set(["C", "H", "N", "O", "S", "P", "F", "Cl", "Br", "I"])
	DEFAULT_ALLOWED_ELEMENTS = {"C", "H", "N", "O", "S", "P", "F", "Cl", "Br", "I"}

		input_data: Any, test_size: float = 0.1, val_size: float = 0.05
		) -> Tuple[Any, Any, Any]:

	atoms_clean = set([re.sub(r"\d+", "", atom) for atom in atoms])
	atoms_clean = {re.sub(r"\d+", "", atom) for atom in atoms}

Matching #1

Are you sure you want to change the base?

Matching #1

Conversation

MAlberts99 commented Sep 4, 2023

Choose a reason for hiding this comment

avaucher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment