Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cold never get classified as a noun in POS tagging. #144

Open
midnight-wonderer opened this issue Nov 27, 2024 · 2 comments
Open

Cold never get classified as a noun in POS tagging. #144

midnight-wonderer opened this issue Nov 27, 2024 · 2 comments

Comments

@midnight-wonderer
Copy link

  const doc = nlp.readDoc('I have a cold.');
  doc.printTokens();

outputs

token      p-spaces   prefix  suffix  shape   case    nerHint type     normal/pos
———————————————————————————————————————————————————————————————————————————————————————
I                 0   I       I       X       2       0       word     i / PRON
have              1   ha      ave     xxxx    1       0       word     have / AUX
a                 1   a       a       x       1       0       word     a / DET
cold              1   co      old     xxxx    1       0       word     cold / ADJ
.                 0   .       .       .       0       0       punctuat . / PUNCT

In all cases I encountered, it never got classified as a noun, always an adjective.

P.S. wink-eng-lite-web-model v1.8.0

@rachnachakraborty
Copy link
Member

Hi @midnight-wonderer

Thanks for sharing your observations!

POS tagging in models like wink-eng-lite-web-model employs probabilistic methods. While effective, these methods don't guarantee 100% accuracy. The quality of training data plays a crucial role in determining the probabilities.

We will explore possibilities for improvements.

Also, please note that .printTokens() is an undocumented, debugging-only API. For better reliability, it’s recommended to use documented methods from the API.

Best,
Rachna

@midnight-wonderer
Copy link
Author

Thank you for the snappy response.

I kind of understand the machine learning bit, but the gist of my concern is that, from what I see, the model seems to have 0% accuracy on this word. (and potentially others)
It never got one right.

I don't actually know the details, but I just suspect the data tagging used for model training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants