Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German lemmatizer performance is bad? #1382

Open
Brentably opened this issue Apr 13, 2024 · 11 comments
Open

German lemmatizer performance is bad? #1382

Brentably opened this issue Apr 13, 2024 · 11 comments
Labels

Comments

@Brentably
Copy link

Hello! I'm currently trying to use Stanza's German lemmatizer for a project I'm working on. As far as I'm concerned, this should be on par with the most accurate publically available lemmatizers out there, if not the most.

However, I'm really confused by the poor German performance. I get the following results when lemmatizing:

möchtest => möchtessen (should be mögen)
Willst => Willst (should be wollen)
sagst => sagst (should be sagen)
Sage => Sage (should be sagen)
aß => aß (should be essen)
Sprich => Sprich (should be sprechen)

These are all top ~50 verbs in german and none of these inflections are crazy rare, so I'm really confused by the performance. I recently did some digging and found out that HDT should be more accurate, and it is, but the results are still unimpressive:

möchtest => möchtes (should be mögen)
Willst => Willst (should be wollen)
sagst => sagsen (should be sagen)
Sage => sagen (correct)
aß => assen (should be essen)
Sprich => sprechen (correct)

This gets 2/6 correct instead of 0/6, but ofc that's still really poor.

I recently found this website cooljugator: https://cooljugator.com/de and for instance, you can just search up a verb, either conjugated or infinitive, and it seems to have near perfect performance for all of these.

Can anyone explain or point me in the right direction?

I'm considering getting a bunch of data and trying to supplement performance with my own lookup table right now, but would rather not spend the few days of effort that would require.

Thanks!

@AngledLuffa
Copy link
Collaborator

Main issue is that the training data just doesn't have those verbs in them. If we had some kind of lexicon available with expected lemmas, we could include that, but we don't have that AFAIK. I can do some digging for that if you don't have suggestions.

One example which shows up in the training data with a different result is Sage. In each of the following sentences, the GSD training data has Sage -> Sage:

# text = Der Sage nach wurden die Nelken 1270 vom Heer des französischen Königs Ludwig IX.
# text = Die Sage, deren historischer Gehalt nicht zu sichern ist, hat insofern ätiologische Funktion.
# text = In den 1920er Jahren hatte er Kontakt mit Cornelia Bentley Sage Quinton, die als erste Frau in den USA ein größeres Kunstmuseum leitete.

One thought which occurs to me is that maybe the lemmatizer's model should have some input based on the POS tag given, whereas it currently doesn't use the POS except for the dictionary lookup. I wonder if that would help in terms of lemmatizing unknown words.

@Brentably
Copy link
Author

Main issue is that the training data just doesn't have those verbs in them. If we had some kind of lexicon available with expected lemmas, we could include that, but we don't have that AFAIK. I can do some digging for that if you don't have suggestions.

You mean like some better lookup data? TBH I was just going to scrape some stuff, but would be happy to send it along.

Also, pardon my naiveté but I'm just generally confused? Isn't this like state of the art for lemmatizers? Are the best lemmatizers all closed source, made in-house, or are there just not that many non-english lemmatizer-dependent applications? Is there another popular solution to this problem that I am ignorant to?

@AngledLuffa
Copy link
Collaborator

The performance was measured on the test portions of the datasets, so to the extent those are limited and don't really cover some important concepts, the test scores will also reflect that.

I don't know what the best German lemmatizer is, but I can take some time later this week or in a chat with my PI to figure out other sources of training data, and I think embedding the POS tags in the seq2seq model will likely help it know whether or not to use a verb style ending or noun style ending in a language such as German for unknown words

@AngledLuffa
Copy link
Collaborator

options for additional training data, from @manning

I think the two main choices are:
https://github.com/Liebeck/IWNLP.Lemmatizer
(uses Wikidict, probably good for future)
https://github.com/WZBSocialScienceCenter/germalemma
(says unmaintained).

I also have high hopes for using the POS as an input embedding to the seq2seq at least helping, but @manning points out that there are a lot of irregulars in German which may or may not be helped by such an approach.

I don't expect to get to this in the next couple days, but perhaps next week or so I can start in on it

@Brentably
Copy link
Author

I scraped some ~5000 words of data from a conjugation / declination website. They seem to be high quality.

@AngledLuffa
Copy link
Collaborator

That does sound like it could be a useful resource!

@Brentably
Copy link
Author

Sent you an email!

@AngledLuffa
Copy link
Collaborator

I started going through the lemma sheet you sent, thinking we could add that as a new lemmatizer model in the next version. (Which will hopefully be soon.)

One thing I came across in my investigation is a weirdness in the GSD lemmas for some words, but not all:

UniversalDependencies/UD_German-GSD#35

I also found some inconsistencies in the json you'd sent us. (Was that script in typescript?)

so for example, early on, words that translate as "few" and "at least" are included in the same lemma:

{
    "word": "wenig",
    "pos": "adj",
    "versions": [
      "weniger",
      "wenigen",
      "wenigem",
      "wenige",
      "weniges",
      "wenig",
      "minder",
      "mindesten"
    ]
  },

wenig and mindesten translate differently on google translate, and mindesten is treated as its own lemma in GSD:

Also treated differently in GSD: welches -> welcher, not welch, and the pos is DET

33      welches welcher DET     PRELS   Case=Acc|Gender=Neut|Number=Sing|PronType=Int,Rel       37      obj     _       _

 {
    "word": "welch, -e, -er, -es",
    "pos": "pron",
    "versions": ["welch", "welche", "welcher", "welches", "welchen", "welchem"]
  },

There are some unusual POS in the data you sent us:

POS of NOUN for Mann, Mannes, Männer, Männern

{
    "word": "Mann",
    "pos": "der",
    "versions": ["Mann", "Mannes", "Manns", "Manne", "Männer", "Männern"]
  },

Also NOUN:

{
    "word": "Kind",
    "pos": "das",
    "versions": ["Kind", "Kindes", "Kinds", "Kinde", "Kinder", "Kindern"]
  },

Ambiguous is hard for us to resolve in an automated fashion:

{
    "word": "kein",
    "pos": "pron/art",
    "versions": ["kein", "keines", "keine", "keinem", "keinen", "keiner"]
  },

not sure what to do with:

  { "word": "nichts, nix", "pos": "pron", "versions": ["nichts", "nix"] },
  { "word": "nun, nu", "pos": "adv", "versions": ["nun", "nu"] },

another example of POS that isn't a UPOS:

  { "word": "Frage", "pos": "die", "versions": ["Frage", "Fragen"] },
  { "word": "Hand", "pos": "die", "versions": ["Hand", "Hände", "Händen"] },

If you can resolve these or suggest how to resolve them, we can include this in the lemmatizer. Certainly in terms of adding a long list of verb, noun, & adj conjugations & declensions, it would be quite useful to avoid future German lemmatizer mistakes.

@Brentably
Copy link
Author

Yes, the script was in typescript.

Is it necessary to have the part of speech on the data? I have an improved list that I also validated with LLMs and cleaned a decent amount, but I started forgoing getting the part of speech on there.

Sent another email with the new list.

@Brentably
Copy link
Author

Also, the "der" and "das" on the POS represents the gender, which is why it's just not marked as NOUN, btw

@Brentably
Copy link
Author

Anyway, if we need to add part of speech back, I suggest just running the data through claude or o1 to generate the parts of speech, which I'm happy to do. LMK however I can help!

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants