Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple Tokenizer not separating punctuation correctly #35

Open
juanjoDiaz opened this issue Jan 19, 2023 · 4 comments
Open

Simple Tokenizer not separating punctuation correctly #35

juanjoDiaz opened this issue Jan 19, 2023 · 4 comments
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@juanjoDiaz
Copy link
Collaborator

It seems that punctuation symbols are not correctly separated.

>>> simple_tokenizer('"test": "that"')
['"', 'test', '":', '"', 'that', '"']

I would have expected:
['"', 'test', '"', ':', '"', 'that', '"']

P.S.: I would rather to review my big refactor before applying any other changes to avoid a constant conflict in the PR 🙂

@adbar adbar added documentation Improvements or additions to documentation question Further information is requested labels Jan 19, 2023
@adbar
Copy link
Owner

adbar commented Jan 19, 2023

Thanks for the feedback!

The tokenizer does something slightly different than usually expected: it clusters chars together while segmenting the input. Since the output only consists of lemmata the idea is to keep it simple and to group punctuation signs because they're not relevant in this case.

Maybe the name could be changed (word tokenizer?), this behavior should at least be documented.

@adbar
Copy link
Owner

adbar commented Jan 19, 2023

Yes, your PR has the priority now!

@juanjoDiaz
Copy link
Collaborator Author

Hi @adbar,

Coming back to this.
Is there any reason to group symbols? Performance or something else?
Or was this just because it was simpler and it does the job as symbols are ignored?

@adbar
Copy link
Owner

adbar commented May 12, 2023

Yes, it's faster and simpler. Otherwise you would have to tokenize punctuation accurately (which is a different task) and run the lemmatizer on it (which is useless in the current context and only means additional processing time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants