Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #104 Add check for dot symbol and warn user #110

Merged
merged 3 commits into from
Nov 23, 2023

Conversation

vandrw
Copy link
Contributor

@vandrw vandrw commented Nov 15, 2023

When using the results of the SELFIES library for an ML model, I encountered some issues while preparing the data for training. If one uses the get_alphabet_from_selfies to generate a list of tokens, and then encodes each input according to it using selfies_to_encoding, it is possible to get a ValueKey error for the dot symbol. More details in #104.

This can be a bit unintuitive, as the key error can result from using the vocabulary generated from the same dataset that is fed to the encoding function.

With this pull request, I suggest adding a more descriptive error for this case, which could help users that are not very knowledgeable in chemistry understand what can be done.

@MarioKrenn6240
Copy link
Collaborator

One question on the PR (which might change variable types thus introduce unexpected behaviours): Why do you recast char_list as a list? char_list = list(split_selfies(selfies))

@vandrw
Copy link
Contributor Author

vandrw commented Nov 19, 2023

As it currently is, char_list is a generator, which will be consumed when we search for the "." character. If the type is important, we could perform the line below in a try-except statement. This should also improve speed for long strings, as we are not evaluating twice. I will draft another commit shortly with this new version.

integer_encoded = [vocab_stoi[char] for char in char_list]

try:
integer_encoded = [vocab_stoi[char] for char in char_list]
except KeyError as e:
if e.args[0] == ".":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! Sorry, I would prefer to merge the first commit since it reads more clearly!

char_list = list(split_selfies(selfies))

# Check if SELFIES string contains unconnected molecules
if "." in list(char_list) and not "." in vocab_stoi:
    raise ValueError(
        "The SELFIES string contains two unconnected molecules "
        "(given by the '.' character), but vocab_stoi does not "
        "contain the '.' key. Please add it or separate the molecules."
    )

integer_encoded = [vocab_stoi[char] for char in char_list]

I agree that the first cast to char_list is necessary, but I think there is a redundant one in the if-statement. Alternatively, maybe a one-pass approach as follows could work:

integer_encoded = []

for char in split_selfies(selfies):
    if (char == ".") and ("." not in vocab_stoi):
        raise ValueError("...")
    integer_encoded.append(vocab_stoi[char])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds better! I've added the suggested one-pass approach in a new commit to make it easier to merge. It passes all the tests, so feel free to merge it or let me know if I can help with anything else!

@MarioKrenn6240 MarioKrenn6240 merged commit 832ada9 into aspuru-guzik-group:master Nov 23, 2023
1 check failed
@MarioKrenn6240
Copy link
Collaborator

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants