-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matching #1
base: main
Are you sure you want to change the base?
Matching #1
Conversation
@@ -337,7 +337,7 @@ def main( | |||
# Load dataframe containing nmr data | |||
logger.info("Reading data.") | |||
nmr_df = pd.read_pickle(nmr_data) | |||
nmr_df.drop(columns=["1H_NMR_exp", "13C_NMR_exp"], inplace=True) | |||
#nmr_df.drop(columns=["1H_NMR_exp", "13C_NMR_exp"], inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not necessary anymore: delete it. Otherwise, you can specify why it is commented out as a comment on the line before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
There's no real comment on things that have to change; it looks like everything is fine as it is. Most of the comments are rather as hints on things I would do differently in the future.
For instance, for type annotations I would always use List[SomeValue]
instead of list
, Dict[str, SomeValue]
instead of dict
, etc. It makes it more readable, as people know directly what format is expected / returned by a function.
Also, in general, when your functions become quite long with several levels of indentations, it can be a sign that a class can be more adequate - this leads to smaller functions and usually makes the code more readable (if the function names are descriptive).
RANDOM_SEED = 3246 | ||
DEFAULT_SEED = 3246 | ||
DEFAULT_NON_MATCHING_TOKEN = "<no_match> <no_match> <no_match> <no_match> <no_match> <no_match> <no_match> <no_match> <no_match> <no_match>" | ||
DEFAULT_ALLOWED_ELEMENTS = set(["C", "H", "N", "O", "S", "P", "F", "Cl", "Br", "I"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEFAULT_ALLOWED_ELEMENTS = set(["C", "H", "N", "O", "S", "P", "F", "Cl", "Br", "I"]) | |
DEFAULT_ALLOWED_ELEMENTS = {"C", "H", "N", "O", "S", "P", "F", "Cl", "Br", "I"} |
slightly more efficient
input_data: Any, test_size: float = 0.1, val_size: float = 0.05 | ||
) -> Tuple[Any, Any, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can: avoid Any
. Would it be pd.DataFrame
here?
input_data, test_size=test_size, random_state=DEFAULT_SEED | ||
) | ||
train_data, val_data = train_test_split( | ||
train_data, test_size=0.05, random_state=RANDOM_SEED | ||
train_data, test_size=val_size, random_state=DEFAULT_SEED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make your function slightly more reusable: you can add an argument for the random seed to the function, , random_seed=DEFAULT_SEED) -> ...
) | ||
|
||
return (train_data, test_data, val_data) | ||
|
||
|
||
def evaluate_molecule(smiles: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a short docstring to clarify what it means to "evaluate" a molecule, and what True/False as a return value would mean.
|
||
formula = rdMolDescriptors.CalcMolFormula(mol) | ||
atoms = re.findall(r"[A-Z][a-z]*\d*", formula) | ||
atoms_clean = set([re.sub(r"\d+", "", atom) for atom in atoms]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
atoms_clean = set([re.sub(r"\d+", "", atom) for atom in atoms]) | |
atoms_clean = {re.sub(r"\d+", "", atom) for atom in atoms} |
nmr_string = " " + cnmr_string | ||
|
||
return nmr_string | ||
return combined_df | ||
|
||
|
||
def make_nmr_rxn_set( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is quite long and a bit complex; if you were to refactor some of the code, this is a typical one that I would try to encapsulate in a class with small modular functions
if nmr_input.count(" ") + 1 > DEFAULT_MAX_SEQ_LEN: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a small comment to say (if I understand correctly) that too long sequences are ignored?
Depending on how often this happens, you may want to consider adding a logger.info
or so
mols_sample = tuple(mols) | ||
|
||
src_tgt_pairs = list() | ||
src_tgt_pairs: List[dict] = list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need the content to be a dictionary? If not, List[Tuple[str, str]]
may be more appropriate.
rxn = rxn.replace(">", ".").replace("..", "") | ||
idx = 0 | ||
pbar = tqdm(total=n_samples) | ||
while len(src_tgt_pairs) < n_samples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this while
loop looks like something that wants to be a for
loop instead. I am not sure if you actually need the idx
variable.
You can do a break
when the size reaches the desired one, and do the check for "Ran out of reactions to consider before reaching desired training samples" outside of the for loop.
# Default mol_distribution to cover equal number of examples for the different set sizes, if non_matching is set to true -> divide by two as two examples will be added per pass | ||
mol_distribution = [ | ||
[ | ||
i, | ||
int(n_samples / ((n_max_mol_set_size - 1) * 2)) | ||
if non_matching | ||
else int(n_samples / (n_max_mol_set_size - 1)), | ||
] | ||
for i in range(2, n_max_mol_set_size + 1) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand what is happening here. What does mol_distribution
contain after that?
You can consider putting this in a separate function.
you can then probably put int(n_samples / ((n_max_mol_set_size - 1) * 2)
in a separate variable before the for loop, with a descriptive name? It does not need to be evaluated at every iteration of the loop.
The PR adds the following functionality to the NMR Matching data generation script: