-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Words that match more than one lemma #94
Comments
Tough one, this is an absolute borderline case since multiple matches are usually not present in lists and they may be annotated differently. Concerning the "noun vs. verb" issue this is indeed one of the main limitations of simplemma, it does not operate with syntactic information. |
Personally, I think the best way to handle this is to actually capture all examples (Schulen -> [Schule, schulen]) present in the corpus, arrange them by frequency, and make a new API (let's call it simplemma.lemmatize_all), which returns a list rather than a single word. This doesn't require drastically complicating the architecture but would still be useful for many situations. |
The approach you suggest would probably give better results but memory is already a concern for the available dictionaries. One way or the other there is always a tradeoff between precision, memory and processing time. |
Regarding the issue of having multiple words separated by
So, I think that we should just modify the training script to correct these. Regarding the proposal of having a list of potential lemma instead of a single lemma, it was just what I proposed in the original issue. I guess that it's a matter of having both options so the user can control memory. |
It would be great to publish the dictionary creation scripts. In particular, I feel that it would be nice to augment the existing data by passing some corpus through an LLM which should be at least decent at the job. My users (https://github.com/FreeLanguageTools/vocabsieve) often complain that the lemmatizer coverage is poor for some languages, severely hindering usability. I know this isn't very elegant and may increase disk space use, but sometimes practicality is more important. A good first step in this direction would be to produce better eval datasets, though. I think the current eval datasets are too small to be representative, as they contain quite few unique lemmas. For this primarily dictionary-based lemmatization, I don't think there is a real need to separate training and validation sets, because you are primarily just memorizing stuff anyways, not generalizing. |
Hi,
I noticed a problem with the behaviour for the word 'schulen' in German (as a noun vs as a verb) when using all capitals:
The all-caps version appears to result in a match with the lowercase :
Of course, it is impossible to know which version to prefer in the case.
I noticed that some of the words in the dictionaries contain more than one lemma.
for example
Why are there words with multiple lemmas in the dictionaries?
Should we consider adding this on simplemma side? I mean changing the strategies so they can have multiple matches and return them somehow.
Wdyt?
The text was updated successfully, but these errors were encountered: