Stanza resolves wrong text for tokens in a multi-word token #1371

khannan-livefront · 2024-03-19T22:57:51Z

Describe the bug
In response to the changes of multi-word tokens in #1361, I encountered an error with how Stanza generates tokens for words with apostrophes, particularly contractions.

To Reproduce
Steps to reproduce the behavior:

Run the sentence:

The schoolmaster's wife started a sewing class.

Check the Universal Dependencies, in particular the tokens for schoolmaster's reveal the incorrect base word of schoolmaterr:

// MWT token is correct:

    {
      "end_char": 18,
      "id": [
        2,
        3
      ],
      "start_char": 4,
      "text": "schoolmaster's"
    },

// non-MWT text resolves incorrectly to "schoolmaterr":

    {
      "deprel": "nmod:poss",
      "feats": "Number=Sing",
      "head": 4,
      "id": 2,
      "lemma": "schoolmaterr",
      "text": "schoolmaterr",
      "upos": "NOUN",
      "xpos": "NN"
    },

 // correct: 
 
    {
      "deprel": "case",
      "head": 2,
      "id": 3,
      "lemma": "'s",
      "text": "'s",
      "upos": "PART",
      "xpos": "POS"
    },

Expected behavior
The non-MWT part of "schoolmaster's" resolves the tokens as schoolmaster / 's

Environment (please complete the following information):

OS: Mac OS Ventura
Python version: Python 3.12.2 using Poetry 1.8.2
Stanza version: Stanza from the dev branch up to commit b62c1e7

Additional context
I believe we found more errors like this, I will report them when I come across them.

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2024-03-19T23:12:22Z

Ultimately we need a major upgrade to the tokenizer, including the MWT splitting mechanism. There is a copy mechanism which tries to use the original text, but clearly it misses some cases. If you collect a few of these weird outliers, we'll rebuild the training data using them, and it should improve the results.

…

On Tue, Mar 19, 2024 at 3:58 PM Kelsey Hannan ***@***.***> wrote: *Describe the bug* In response to the changes of multi-word tokens in #1361 <#1361>, I encountered an error with how Stanza generates tokens for words with apostrophes, particularly contractions. *To Reproduce* Steps to reproduce the behavior: 1. Run the sentence: The schoolmaster's wife started a sewing class. 1. Check the Universal Dependencies, in particular the tokens for schoolmaster's reveal the incorrect base word of schoolmaterr: // MWT token is correct: {"end_char"=>18, "id"=>[2, 3], "start_char"=>4, "text"=>"schoolmaster's"}, // non-MWT text resolves incorrectly to "schoolmaterr" {"deprel"=>"nmod:poss", "feats"=>"Number=Sing", "head"=>4, "id"=>2, "lemma"=>"schoolmaterr", "text"=>"schoolmaterr", "upos"=>"NOUN", "xpos"=>"NN"}, // correct {"deprel"=>"case", "head"=>2, "id"=>3, "lemma"=>"'s", "text"=>"'s", "upos"=>"PART", "xpos"=>"POS"}, *Expected behavior* The non-MWT part of "schoolmaster's" resolves the tokens as schoolmaster / 's *Environment (please complete the following information):* - OS: Mac OS Ventura - Python version: Python 3.12.2 using Poetry 1.8.2 - Stanza version: Stanza from the dev branch up to commit b62c1e7 <b62c1e7> *Additional context* I believe we found more errors like this, I will report them when I come across them. — Reply to this email directly, view it on GitHub <#1371>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWM2TEXVFYQL2NZANR3YZC7IPAVCNFSM6AAAAABE6OQT6KVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TMMJZGM4DSOI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

khannan-livefront · 2024-03-20T17:30:27Z

@AngledLuffa Yes it appears the splitting mechanism is actually quite broken. We run stanza against gigantic amounts of text, with no problem retaining the correct text values when we did so against versions of Stanza before the introduction of MWT tokens. In our latest runs we encountered thousand's of errors involving them, and here are a sample of sentences where the token resolve the wrong capitalization or munges the text of the original word (sometimes inexplicably) for word's with an apostrophe / contraction / possessive:

In God’s name let him be so.
Didn’t I say so?
They began to be frightened at last at Pulcheria Alexandrovna's strange silence.
Pulcheria Alexandrovna's illness was a strange nervous one.
Couldn't he stop and retract it all?
She'll be my nurse.
It was quite an accident Lebeziatnikov's turning up.
Wasn't I right in saying that we were birds of a feather?
Couldn't he come?
She'll get it at the shop, my dear.
Who married the Marquis of Saint-Méran's daughter?
But to Dantès' eye there was no darkness.
Couldn’t he have waited for the good season to increase his chances?
She'd never noticed if it hadn't been for Sid.
Wasn't that a happy thought?

E.g. digging into the types of failures being seen, by showing the token split for the text value:

In god’s name let him be so. -> God's / god / 's (loss of capitalization)

Didn't I say so. -> Didn't / did / not (loss of capitalization)

They began to be frightened at last at Pulcheria Alexandrovna's strange silence. -> Alexandrovna's / Alexandronan / 's (flat out wrong)

Couldn't he stop and retract it all? -> Couldn't / could / n't (loss of capitalization)

It was quite an accident Lebeziatnikov's turning up. -> Lebeziatnikov's / Lebeziatikovv / 's (flat out wrong)

Who married the Marquis of Saint-Méran's daughter? -> Méran's / Mran / 's (clear bug, seeing the insertion of a <UNK> tag)

But to Dantès' eye there was no darkness. -> Dantès' / Dants / ' (clear bug here, and seeing the insertion of a <UNK> tag)

Sadly I think we might have to revert back to a version of Stanza without MWT-based tokens as they don't appear to be stable enough for our purposes to rely on. :(

AngledLuffa · 2024-03-21T08:30:13Z

This is probably fixable in one of a couple different ways. The English datasets are generally built so that the MWT are exactly composed of the subwords, so there's no reason to do anything other than split the original text. Alternatively, there may be some generalization of that to other languages where the model first checks if it's supposed to use an exact split of the words, and then it uses the seq2seq model only for words that aren't an exact split.

khannan-livefront · 2024-03-22T20:01:57Z

Yeah that would be very helpful. The apostrophe splitting used to be very stable for the english model before the addition of MWT tokens.

khannan-livefront · 2024-03-27T19:38:12Z

@AngledLuffa I should note also, this bug is occurring with the constituency parse also. We get schoolmaterr and 's as the leaf nodes for schoolmaster's

AngledLuffa · 2024-04-03T06:52:39Z

Thought about it some. Realized I was probably overthinking, and the easiest thing to do was continue using the current model and make it replace the prediction characters with the original text if the length added up to the original text. Only for languages where this property happens, of course.

@qipeng

AngledLuffa · 2024-04-03T07:54:17Z

The english_mwt branch should now have a fix for most of these issues, although I wouldn't be surprised if the new model still occasionally hallucinates text which isn't the right length (in which case the new splitting mechanism won't work). Please LMK if this helps, and feel free to report whichever words still aren't split correctly.

The constituency parser issue is just because the parser as part of the pipeline is using the MWT as input, so it's getting the weird splits just like the other models.

AngledLuffa · 2024-04-03T07:56:41Z

I should note that although this helps with the OOV characters, there is another issue with words at the start of a sentence being split with the dictionary lookup and then lowercased... I don't think that's correct behavior of the model, and I can probably fix that a lot faster than it took to fix this.

AngledLuffa · 2024-04-03T21:52:34Z

I think the use of lowercasing in the MWT is just a straight up logic bug:

stanza/stanza/models/mwt/trainer.py

Line 99 in c2d72bd

elif w.lower() in self.expansion_dict:

stanza/stanza/models/mwt/trainer.py

Line 112 in c2d72bd

elif c.lower() in self.expansion_dict:

My own expectation with tokenization is that it doesn't change the characters used unnecessarily, but consider:

>>> [x.text for x in pipe("JENNIFER HAS NICE ANTENNAE").sentences[0].words]
['JENNIFER', 'HAS', 'NICE', 'ANTENNAE']
>>> [x.text for x in pipe("JENNIFER'S GOT NICE ANTENNAE").sentences[0].words]
['JENNIFER', "'S", 'GOT', 'NICE', 'ANTENNAE']
>>> [x.text for x in pipe("SHE'S GOT NICE ANTENNAE").sentences[0].words]   # oops, this shows up in the dictionary
['she', "'s", 'GOT', 'NICE', 'ANTENNAE']

Maybe a reasonable fix would be to only look up all lowercase, leading uppercase, and all uppercase, and weird mixed cases just have to go through the seq2seq model

…original word matches one of a couple expected casing formats, in which case we can recreate those formats after using the dictionary lookup. Otherwise, you get unexpected tokenizations such as She's -> she 's. #1371

AngledLuffa · 2024-04-03T22:26:23Z

With these changes, here's what I get in the linked branch (which I'll merge into dev once I start hearing back from stakeholders in this issue) for the text you mentioned above

>>> text = """
... In God’s name let him be so.
... Didn’t I say so?
... They began to be frightened at last at Pulcheria Alexandrovna's strange silence.
... Pulcheria Alexandrovna's illness was a strange nervous one.
... Couldn't he stop and retract it all?
... She'll be my nurse.
... It was quite an accident Lebeziatnikov's turning up.
... Wasn't I right in saying that we were birds of a feather?
... Couldn't he come?
... She'll get it at the shop, my dear.
... Who married the Marquis of Saint-Méran's daughter?
... But to Dantès' eye there was no darkness.
... Couldn’t he have waited for the good season to increase his chances?
... She'd never noticed if it hadn't been for Sid.
... Wasn't that a happy thought?
... """
>>> text = text.strip().split("\n")
>>> for line in text:
...   print([x.text for x in pipe(line).sentences[0].words])


['In', 'God', '’s', 'name', 'let', 'him', 'be', 'so', '.']
['Did', 'n’t', 'I', 'say', 'so', '?']
['They', 'began', 'to', 'be', 'frightened', 'at', 'last', 'at', 'Pulcheria', 'Alexandrovna', "'s", 'strange', 'silence', '.']
['Pulcheria', 'Alexandrovna', "'s", 'illness', 'was', 'a', 'strange', 'nervous', 'one', '.']
['Could', "n't", 'he', 'stop', 'and', 'retract', 'it', 'all', '?']
['She', "'ll", 'be', 'my', 'nurse', '.']
['It', 'was', 'quite', 'an', 'accident', 'Lebeziatnikov', "'s", 'turning', 'up', '.']
['Was', "n't", 'I', 'right', 'in', 'saying', 'that', 'we', 'were', 'birds', 'of', 'a', 'feather', '?']
['Could', "n't", 'he', 'come', '?']
['She', "'ll", 'get', 'it', 'at', 'the', 'shop', ',', 'my', 'dear', '.']
['Who', 'married', 'the', 'Marquis', 'of', 'Saint', '-', 'Méran', "'s", 'daughter', '?']
['But', 'to', 'Dantès', "'", 'eye', 'there', 'was', 'no', 'darkness', '.']
['Could', 'n’t', 'he', 'have', 'waited', 'for', 'the', 'good', 'season', 'to', 'increase', 'his', 'chances', '?']
['She', "'d", 'never', 'noticed', 'if', 'it', 'had', "n't", 'been', 'for', 'Sid', '.']
['Was', "n't", 'that', 'a', 'happy', 'thought', '?']

AngledLuffa · 2024-04-03T22:28:27Z

This looks much better to me, but I will say there's an annoying 99.98% accuracy on the test set, whereas I would have expected 100% now that the casing is fixed and the model is forced to copy the input whenever possible. Hopefully it's just a random word which doesn't show up in the training data and isn't being correctly split, rather than one of the hallucinations or re-casings we're trying to fix

AngledLuffa · 2024-04-03T22:46:30Z

Eh, well, the non-100% appears to be entirely typos which were annotated in the test set and not handled in the expected manner by the model. Situations where the tokenizer isn't actually splitting an MWT, but if you force the MWT processor to make a decision, it doesn't really know how to process "Mens room" instead of "Men's room" etc

11944,11945c11943,11944
< 9     Cox
< 10    '
---
> 9     Co
> 10    x'
15578,15579c15577,15578
< 16    sheep
< 17    s
---
> 16    shee
> 17    ps
26469,26470c26467,26468
< 1     Men
< 2     s
---
> 1     Me
> 2     ns

khannan-livefront · 2024-04-03T23:20:34Z

Thanks for looking into this @AngledLuffa!! I will update my stanza to the latest on the dev branch to pick up these changes.

AngledLuffa · 2024-04-03T23:42:56Z

Just to be clear, it's not merged yet - but probably soon, since it is passing the current unit tests and is a lot better on the test cases you gave us. Especially if you report back saying you like this branch more than the current release :)

…original word matches one of a couple expected casing formats, in which case we can recreate those formats after using the dictionary lookup. Otherwise, you get unexpected tokenizations such as She's -> she 's. #1371

khannan-livefront · 2024-04-12T14:55:10Z

Just tested out the latest changes, no problems on my end!

AngledLuffa · 2024-04-20T18:59:03Z

This is now part of the 1.8.2 release

…original word matches one of a couple expected casing formats, in which case we can recreate those formats after using the dictionary lookup. Otherwise, you get unexpected tokenizations such as She's -> she 's. #1371

khannan-livefront added the bug label Mar 19, 2024

AngledLuffa mentioned this issue Apr 3, 2024

English mwt #1378

Merged

AngledLuffa closed this as completed Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stanza resolves wrong text for tokens in a multi-word token #1371

Stanza resolves wrong text for tokens in a multi-word token #1371

khannan-livefront commented Mar 19, 2024 •

edited

Loading

AngledLuffa commented Mar 19, 2024 via email

khannan-livefront commented Mar 20, 2024 •

edited

Loading

AngledLuffa commented Mar 21, 2024

khannan-livefront commented Mar 22, 2024

khannan-livefront commented Mar 27, 2024

AngledLuffa commented Apr 3, 2024

AngledLuffa commented Apr 3, 2024

AngledLuffa commented Apr 3, 2024

AngledLuffa commented Apr 3, 2024

AngledLuffa commented Apr 3, 2024

AngledLuffa commented Apr 3, 2024

AngledLuffa commented Apr 3, 2024

khannan-livefront commented Apr 3, 2024

AngledLuffa commented Apr 3, 2024

khannan-livefront commented Apr 12, 2024

AngledLuffa commented Apr 20, 2024

Stanza resolves wrong text for tokens in a multi-word token #1371

Stanza resolves wrong text for tokens in a multi-word token #1371

Comments

khannan-livefront commented Mar 19, 2024 • edited Loading

AngledLuffa commented Mar 19, 2024 via email

khannan-livefront commented Mar 20, 2024 • edited Loading

AngledLuffa commented Mar 21, 2024

khannan-livefront commented Mar 22, 2024

khannan-livefront commented Mar 27, 2024

AngledLuffa commented Apr 3, 2024

AngledLuffa commented Apr 3, 2024

AngledLuffa commented Apr 3, 2024

AngledLuffa commented Apr 3, 2024

AngledLuffa commented Apr 3, 2024

AngledLuffa commented Apr 3, 2024

AngledLuffa commented Apr 3, 2024

khannan-livefront commented Apr 3, 2024

AngledLuffa commented Apr 3, 2024

khannan-livefront commented Apr 12, 2024

AngledLuffa commented Apr 20, 2024

khannan-livefront commented Mar 19, 2024 •

edited

Loading

khannan-livefront commented Mar 20, 2024 •

edited

Loading