A question about tokenization #27

luryZhu · 2023-02-23T08:33:38Z

Dear the Development Team,

I run the code on the best model and find out that the tokens of the sentence seems to be different from other tokenizers, especially for those words with puncts.

For instancs, sentence "Not only was the food outstanding, but the little 'perks' were great.", the tokens are:

["Not","only","was","the","food","outstanding",",","but","the","little","'perks","›","were","great","."]

The word 'perks' is tokenized as 'perks and ›.

So I was wondering if I could alter the tokenizer of this parser to other methods like Stanza?

Thank you

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question about tokenization #27

A question about tokenization #27

luryZhu commented Feb 23, 2023

A question about tokenization #27

A question about tokenization #27

Comments

luryZhu commented Feb 23, 2023