You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The tokenizer does something slightly different than usually expected: it clusters chars together while segmenting the input. Since the output only consists of lemmata the idea is to keep it simple and to group punctuation signs because they're not relevant in this case.
Maybe the name could be changed (word tokenizer?), this behavior should at least be documented.
Coming back to this.
Is there any reason to group symbols? Performance or something else?
Or was this just because it was simpler and it does the job as symbols are ignored?
Yes, it's faster and simpler. Otherwise you would have to tokenize punctuation accurately (which is a different task) and run the lemmatizer on it (which is useless in the current context and only means additional processing time).
It seems that punctuation symbols are not correctly separated.
I would have expected:
['"', 'test', '"', ':', '"', 'that', '"']
P.S.: I would rather to review my big refactor before applying any other changes to avoid a constant conflict in the PR 🙂
The text was updated successfully, but these errors were encountered: