German lemmatizer performance is bad? #1382

Brentably · 2024-04-13T23:41:40Z

Hello! I'm currently trying to use Stanza's German lemmatizer for a project I'm working on. As far as I'm concerned, this should be on par with the most accurate publically available lemmatizers out there, if not the most.

However, I'm really confused by the poor German performance. I get the following results when lemmatizing:

möchtest => möchtessen (should be mögen)
Willst => Willst (should be wollen)
sagst => sagst (should be sagen)
Sage => Sage (should be sagen)
aß => aß (should be essen)
Sprich => Sprich (should be sprechen)

These are all top ~50 verbs in german and none of these inflections are crazy rare, so I'm really confused by the performance. I recently did some digging and found out that HDT should be more accurate, and it is, but the results are still unimpressive:

möchtest => möchtes (should be mögen)
Willst => Willst (should be wollen)
sagst => sagsen (should be sagen)
Sage => sagen (correct)
aß => assen (should be essen)
Sprich => sprechen (correct)

This gets 2/6 correct instead of 0/6, but ofc that's still really poor.

I recently found this website cooljugator: https://cooljugator.com/de and for instance, you can just search up a verb, either conjugated or infinitive, and it seems to have near perfect performance for all of these.

Can anyone explain or point me in the right direction?

I'm considering getting a bunch of data and trying to supplement performance with my own lookup table right now, but would rather not spend the few days of effort that would require.

Thanks!

AngledLuffa · 2024-04-14T08:04:00Z

Main issue is that the training data just doesn't have those verbs in them. If we had some kind of lexicon available with expected lemmas, we could include that, but we don't have that AFAIK. I can do some digging for that if you don't have suggestions.

One example which shows up in the training data with a different result is Sage. In each of the following sentences, the GSD training data has Sage -> Sage:

# text = Der Sage nach wurden die Nelken 1270 vom Heer des französischen Königs Ludwig IX.
# text = Die Sage, deren historischer Gehalt nicht zu sichern ist, hat insofern ätiologische Funktion.
# text = In den 1920er Jahren hatte er Kontakt mit Cornelia Bentley Sage Quinton, die als erste Frau in den USA ein größeres Kunstmuseum leitete.

One thought which occurs to me is that maybe the lemmatizer's model should have some input based on the POS tag given, whereas it currently doesn't use the POS except for the dictionary lookup. I wonder if that would help in terms of lemmatizing unknown words.

Brentably · 2024-04-14T19:57:58Z

Main issue is that the training data just doesn't have those verbs in them. If we had some kind of lexicon available with expected lemmas, we could include that, but we don't have that AFAIK. I can do some digging for that if you don't have suggestions.

You mean like some better lookup data? TBH I was just going to scrape some stuff, but would be happy to send it along.

Also, pardon my naiveté but I'm just generally confused? Isn't this like state of the art for lemmatizers? Are the best lemmatizers all closed source, made in-house, or are there just not that many non-english lemmatizer-dependent applications? Is there another popular solution to this problem that I am ignorant to?

AngledLuffa · 2024-04-14T20:09:31Z

The performance was measured on the test portions of the datasets, so to the extent those are limited and don't really cover some important concepts, the test scores will also reflect that.

I don't know what the best German lemmatizer is, but I can take some time later this week or in a chat with my PI to figure out other sources of training data, and I think embedding the POS tags in the seq2seq model will likely help it know whether or not to use a verb style ending or noun style ending in a language such as German for unknown words

AngledLuffa · 2024-04-17T21:31:14Z

options for additional training data, from @manning

I think the two main choices are:
https://github.com/Liebeck/IWNLP.Lemmatizer
(uses Wikidict, probably good for future)
https://github.com/WZBSocialScienceCenter/germalemma
(says unmaintained).

I also have high hopes for using the POS as an input embedding to the seq2seq at least helping, but @manning points out that there are a lot of irregulars in German which may or may not be helped by such an approach.

I don't expect to get to this in the next couple days, but perhaps next week or so I can start in on it

Brentably · 2024-04-17T22:14:33Z

I scraped some ~5000 words of data from a conjugation / declination website. They seem to be high quality.

AngledLuffa · 2024-04-17T22:15:37Z

That does sound like it could be a useful resource!

Brentably · 2024-04-17T22:41:53Z

Sent you an email!

AngledLuffa · 2024-08-08T17:42:36Z

I started going through the lemma sheet you sent, thinking we could add that as a new lemmatizer model in the next version. (Which will hopefully be soon.)

One thing I came across in my investigation is a weirdness in the GSD lemmas for some words, but not all:

UniversalDependencies/UD_German-GSD#35

I also found some inconsistencies in the json you'd sent us. (Was that script in typescript?)

so for example, early on, words that translate as "few" and "at least" are included in the same lemma:

{
    "word": "wenig",
    "pos": "adj",
    "versions": [
      "weniger",
      "wenigen",
      "wenigem",
      "wenige",
      "weniges",
      "wenig",
      "minder",
      "mindesten"
    ]
  },

wenig and mindesten translate differently on google translate, and mindesten is treated as its own lemma in GSD:

Also treated differently in GSD: welches -> welcher, not welch, and the pos is DET

33      welches welcher DET     PRELS   Case=Acc|Gender=Neut|Number=Sing|PronType=Int,Rel       37      obj     _       _

 {
    "word": "welch, -e, -er, -es",
    "pos": "pron",
    "versions": ["welch", "welche", "welcher", "welches", "welchen", "welchem"]
  },

There are some unusual POS in the data you sent us:

POS of NOUN for Mann, Mannes, Männer, Männern

{
    "word": "Mann",
    "pos": "der",
    "versions": ["Mann", "Mannes", "Manns", "Manne", "Männer", "Männern"]
  },

Also NOUN:

{
    "word": "Kind",
    "pos": "das",
    "versions": ["Kind", "Kindes", "Kinds", "Kinde", "Kinder", "Kindern"]
  },

Ambiguous is hard for us to resolve in an automated fashion:

{
    "word": "kein",
    "pos": "pron/art",
    "versions": ["kein", "keines", "keine", "keinem", "keinen", "keiner"]
  },

not sure what to do with:

  { "word": "nichts, nix", "pos": "pron", "versions": ["nichts", "nix"] },
  { "word": "nun, nu", "pos": "adv", "versions": ["nun", "nu"] },

another example of POS that isn't a UPOS:

  { "word": "Frage", "pos": "die", "versions": ["Frage", "Fragen"] },
  { "word": "Hand", "pos": "die", "versions": ["Hand", "Hände", "Händen"] },

If you can resolve these or suggest how to resolve them, we can include this in the lemmatizer. Certainly in terms of adding a long list of verb, noun, & adj conjugations & declensions, it would be quite useful to avoid future German lemmatizer mistakes.

Brentably · 2024-11-13T21:22:21Z

Yes, the script was in typescript.

Is it necessary to have the part of speech on the data? I have an improved list that I also validated with LLMs and cleaned a decent amount, but I started forgoing getting the part of speech on there.

Sent another email with the new list.

Brentably · 2024-11-13T21:22:50Z

Also, the "der" and "das" on the POS represents the gender, which is why it's just not marked as NOUN, btw

Brentably · 2024-11-13T21:25:25Z

Anyway, if we need to add part of speech back, I suggest just running the data through claude or o1 to generate the parts of speech, which I'm happy to do. LMK however I can help!

Thanks

Brentably added the question label Apr 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

German lemmatizer performance is bad? #1382

German lemmatizer performance is bad? #1382

Brentably commented Apr 13, 2024

AngledLuffa commented Apr 14, 2024

Brentably commented Apr 14, 2024

AngledLuffa commented Apr 14, 2024

AngledLuffa commented Apr 17, 2024

Brentably commented Apr 17, 2024

AngledLuffa commented Apr 17, 2024

Brentably commented Apr 17, 2024

AngledLuffa commented Aug 8, 2024

Brentably commented Nov 13, 2024

Brentably commented Nov 13, 2024

Brentably commented Nov 13, 2024

German lemmatizer performance is bad? #1382

German lemmatizer performance is bad? #1382

Comments

Brentably commented Apr 13, 2024

AngledLuffa commented Apr 14, 2024

Brentably commented Apr 14, 2024

AngledLuffa commented Apr 14, 2024

AngledLuffa commented Apr 17, 2024

Brentably commented Apr 17, 2024

AngledLuffa commented Apr 17, 2024

Brentably commented Apr 17, 2024

AngledLuffa commented Aug 8, 2024

Brentably commented Nov 13, 2024

Brentably commented Nov 13, 2024

Brentably commented Nov 13, 2024