-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix #104 Add check for dot symbol and warn user #110
Conversation
One question on the PR (which might change variable types thus introduce unexpected behaviours): Why do you recast |
As it currently is, selfies/selfies/utils/encoding_utils.py Line 51 in 120b776
|
selfies/utils/encoding_utils.py
Outdated
try: | ||
integer_encoded = [vocab_stoi[char] for char in char_list] | ||
except KeyError as e: | ||
if e.args[0] == ".": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR! Sorry, I would prefer to merge the first commit since it reads more clearly!
char_list = list(split_selfies(selfies))
# Check if SELFIES string contains unconnected molecules
if "." in list(char_list) and not "." in vocab_stoi:
raise ValueError(
"The SELFIES string contains two unconnected molecules "
"(given by the '.' character), but vocab_stoi does not "
"contain the '.' key. Please add it or separate the molecules."
)
integer_encoded = [vocab_stoi[char] for char in char_list]
I agree that the first cast to char_list
is necessary, but I think there is a redundant one in the if-statement. Alternatively, maybe a one-pass approach as follows could work:
integer_encoded = []
for char in split_selfies(selfies):
if (char == ".") and ("." not in vocab_stoi):
raise ValueError("...")
integer_encoded.append(vocab_stoi[char])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds better! I've added the suggested one-pass approach in a new commit to make it easier to merge. It passes all the tests, so feel free to merge it or let me know if I can help with anything else!
Thank you! |
When using the results of the SELFIES library for an ML model, I encountered some issues while preparing the data for training. If one uses the
get_alphabet_from_selfies
to generate a list of tokens, and then encodes each input according to it usingselfies_to_encoding
, it is possible to get a ValueKey error for the dot symbol. More details in #104.This can be a bit unintuitive, as the key error can result from using the vocabulary generated from the same dataset that is fed to the encoding function.
With this pull request, I suggest adding a more descriptive error for this case, which could help users that are not very knowledgeable in chemistry understand what can be done.