-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lemmatization issues [Italian][Spanish][French] #12954
Comments
Thanks for the examples, they'll be helpful when looking at how to improve the lemmatizers in the future! |
Also in french "domicile" becomes "domicil" which is not correct A sanity check can be added :double lemmatization should not change result.
In version 3.7.2 |
The French lemmatizer in the v3.7 trained pipelines is a rule-based lemmatizer that depends on the part-of-speech tags from the statistical tagger to choose which rules to apply. In these pipelines, the tags come from the Here it looks like it's tagging "domicil" as The statistical components like the tagger and morphologizer aren't really intended for processing individual words out of context. Even just a smidgen of context improves the results: import spacy
nlp = spacy.load("fr_core_news_md")
assert nlp("le domicile")[1].lemma_ == "domicile" If you want to double-check that the rules are working as intended (since sometimes it may be a problem with the rules or exceptions and not the POS tag), you can test just the lemmatizer component by providing POS tag by hand: import spacy
nlp = spacy.load("fr_core_news_md")
doc = nlp.make_doc("domicile") # just tokenization, no pipeline components
doc[0].pos_ = "NOUN"
assert nlp.get_pipe("lemmatizer")(doc)[0].lemma_ == "domicile" |
Thanks for helping. It also explains why lemmatization does not work when disabling others modules. NLP_FR = spacy.load("fr_core_news_md", disable=["morphologizer", "parser", "senter", "ner", "attribute_ruler"]) No error when calling .lemma_, but do nothing. Would be better to throw an error if possible. In this example, word "domicil" does not exist at all in french dictionary. Based on data I have to handle, around 1185 items where double lemmatization provide different (better) results. version 3.2.0 was keeping "domicile" correct when submitted for lemmatization. Would it be possible to train it with this project http://www.lexique.org/ ? |
We wouldn't use the lexique data in our pipelines due to the non-commercial clause in the CC BY-NC license, but if the license works for your use case and you'd prefer to use it, it's pretty easy to create a lookup table that you can use with the We have an example of a French lookup lemmatizer table here: |
@adrianeboyd ok thanks for explanation. I guess spacy remove the last character when it encounter an unknown word to lemmatize. NLP_FR("xxxxx")[0].lemma_
Out[10]: 'xxxx'
NLP_FR(NLP_FR("xxxxx")[0].lemma_)[0].lemma_
Out[11]: 'xxx' |
Overall it sounds like a lookup lemmatizer, which doesn't depend on context, might be a better fit for these kinds of examples. You can see how to switch from the rule-based lemmatizer to the lookup lemmatizer: https://spacy.io/models#design-modify You can also provide your own lookup table instead of using the default one from
This is not what is going on. Not that there can't be problems with the lemmatizer rules and tables, but I'd be very surprised if simply removing any final character were one of the existing rules for any of the rule-based lemmatizers provided in the trained pipelines. You can take a closer look at the rules for French, which are here under along with the language-specific lemmatizer implementation under These are suffix rewrite rules, and I think this is the rule that it's applying for the final |
@adrianeboyd thanks, using rules returns same results as before with 3.2, which are much better in our case. |
Hello,
As a follow up on #11298 #11347, I would like to report some lemmatization problems with the spaCy 3.6 models for Italian, Spanish and French. We did not have these issues with the 3.2 version.
How to reproduce the behaviour
Here are some examples:
I guess the issue with tokens on the beginning of sentences (because they are wrongly detected as PROPN) has been already mentioned many times.
Your Environment
The text was updated successfully, but these errors were encountered: