Simple Tokenizer not separating punctuation correctly #35

juanjoDiaz · 2023-01-19T13:40:42Z

It seems that punctuation symbols are not correctly separated.

>>> simple_tokenizer('"test": "that"')
['"', 'test', '":', '"', 'that', '"']

I would have expected:
['"', 'test', '"', ':', '"', 'that', '"']

P.S.: I would rather to review my big refactor before applying any other changes to avoid a constant conflict in the PR 🙂

adbar · 2023-01-19T17:49:29Z

Thanks for the feedback!

The tokenizer does something slightly different than usually expected: it clusters chars together while segmenting the input. Since the output only consists of lemmata the idea is to keep it simple and to group punctuation signs because they're not relevant in this case.

Maybe the name could be changed (word tokenizer?), this behavior should at least be documented.

adbar · 2023-01-19T17:50:05Z

Yes, your PR has the priority now!

juanjoDiaz · 2023-05-12T17:52:03Z

Hi @adbar,

Coming back to this.
Is there any reason to group symbols? Performance or something else?
Or was this just because it was simpler and it does the job as symbols are ignored?

adbar · 2023-05-12T19:58:44Z

Yes, it's faster and simpler. Otherwise you would have to tokenize punctuation accurately (which is a different task) and run the lemmatizer on it (which is useless in the current context and only means additional processing time).

adbar added documentation Improvements or additions to documentation question Further information is requested labels Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple Tokenizer not separating punctuation correctly #35

Simple Tokenizer not separating punctuation correctly #35

juanjoDiaz commented Jan 19, 2023

adbar commented Jan 19, 2023

adbar commented Jan 19, 2023

juanjoDiaz commented May 12, 2023

adbar commented May 12, 2023

Simple Tokenizer not separating punctuation correctly #35

Simple Tokenizer not separating punctuation correctly #35

Comments

juanjoDiaz commented Jan 19, 2023

adbar commented Jan 19, 2023

adbar commented Jan 19, 2023

juanjoDiaz commented May 12, 2023

adbar commented May 12, 2023