fix #104 Add check for dot symbol and warn user #110

vandrw · 2023-11-15T19:37:00Z

When using the results of the SELFIES library for an ML model, I encountered some issues while preparing the data for training. If one uses the get_alphabet_from_selfies to generate a list of tokens, and then encodes each input according to it using selfies_to_encoding, it is possible to get a ValueKey error for the dot symbol. More details in #104.

This can be a bit unintuitive, as the key error can result from using the vocabulary generated from the same dataset that is fed to the encoding function.

With this pull request, I suggest adding a more descriptive error for this case, which could help users that are not very knowledgeable in chemistry understand what can be done.

MarioKrenn6240 · 2023-11-19T00:57:16Z

One question on the PR (which might change variable types thus introduce unexpected behaviours): Why do you recast char_list as a list? char_list = list(split_selfies(selfies))

vandrw · 2023-11-19T10:41:23Z

As it currently is, char_list is a generator, which will be consumed when we search for the "." character. If the type is important, we could perform the line below in a try-except statement. This should also improve speed for long strings, as we are not evaluating twice. I will draft another commit shortly with this new version.

selfies/selfies/utils/encoding_utils.py

Line 51 in 120b776

integer_encoded = [vocab_stoi[char] for char in char_list]

alstonlo · 2023-11-21T20:38:30Z

selfies/utils/encoding_utils.py

+    try:
+        integer_encoded = [vocab_stoi[char] for char in char_list]
+    except KeyError as e:
+        if e.args[0] == ".":


Thanks for this PR! Sorry, I would prefer to merge the first commit since it reads more clearly!

char_list = list(split_selfies(selfies)) # Check if SELFIES string contains unconnected molecules if "." in list(char_list) and not "." in vocab_stoi: raise ValueError( "The SELFIES string contains two unconnected molecules " "(given by the '.' character), but vocab_stoi does not " "contain the '.' key. Please add it or separate the molecules." ) integer_encoded = [vocab_stoi[char] for char in char_list]

I agree that the first cast to char_list is necessary, but I think there is a redundant one in the if-statement. Alternatively, maybe a one-pass approach as follows could work:

integer_encoded = [] for char in split_selfies(selfies): if (char == ".") and ("." not in vocab_stoi): raise ValueError("...") integer_encoded.append(vocab_stoi[char])

That sounds better! I've added the suggested one-pass approach in a new commit to make it easier to merge. It passes all the tests, so feel free to merge it or let me know if I can help with anything else!

MarioKrenn6240 · 2023-11-23T17:28:18Z

Thank you!

fix aspuru-guzik-group#104 Add check for dot symbol and warn user

1e60b91

Move search for dot symbol in try-except

1d22f1d

alstonlo reviewed Nov 21, 2023

View reviewed changes

Add one-pass check for unconnected molecules

00756c6

MarioKrenn6240 merged commit 832ada9 into aspuru-guzik-group:master Nov 23, 2023
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix #104 Add check for dot symbol and warn user #110

fix #104 Add check for dot symbol and warn user #110

vandrw commented Nov 15, 2023 •

edited

Loading

MarioKrenn6240 commented Nov 19, 2023

vandrw commented Nov 19, 2023 •

edited

Loading

alstonlo Nov 21, 2023

vandrw Nov 21, 2023

MarioKrenn6240 commented Nov 23, 2023

fix #104 Add check for dot symbol and warn user #110

fix #104 Add check for dot symbol and warn user #110

Conversation

vandrw commented Nov 15, 2023 • edited Loading

MarioKrenn6240 commented Nov 19, 2023

vandrw commented Nov 19, 2023 • edited Loading

alstonlo Nov 21, 2023

Choose a reason for hiding this comment

vandrw Nov 21, 2023

Choose a reason for hiding this comment

MarioKrenn6240 commented Nov 23, 2023

vandrw commented Nov 15, 2023 •

edited

Loading

vandrw commented Nov 19, 2023 •

edited

Loading