Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stanza resolves wrong text for tokens in a multi-word token #1371

Closed
khannan-livefront opened this issue Mar 19, 2024 · 16 comments
Closed

Stanza resolves wrong text for tokens in a multi-word token #1371

khannan-livefront opened this issue Mar 19, 2024 · 16 comments
Labels

Comments

@khannan-livefront
Copy link

khannan-livefront commented Mar 19, 2024

Describe the bug
In response to the changes of multi-word tokens in #1361, I encountered an error with how Stanza generates tokens for words with apostrophes, particularly contractions.

To Reproduce
Steps to reproduce the behavior:

  1. Run the sentence:
The schoolmaster's wife started a sewing class.
  1. Check the Universal Dependencies, in particular the tokens for schoolmaster's reveal the incorrect base word of schoolmaterr:
// MWT token is correct:

    {
      "end_char": 18,
      "id": [
        2,
        3
      ],
      "start_char": 4,
      "text": "schoolmaster's"
    },

// non-MWT text resolves incorrectly to "schoolmaterr":

    {
      "deprel": "nmod:poss",
      "feats": "Number=Sing",
      "head": 4,
      "id": 2,
      "lemma": "schoolmaterr",
      "text": "schoolmaterr",
      "upos": "NOUN",
      "xpos": "NN"
    },

 // correct: 
 
    {
      "deprel": "case",
      "head": 2,
      "id": 3,
      "lemma": "'s",
      "text": "'s",
      "upos": "PART",
      "xpos": "POS"
    },
   

Expected behavior
The non-MWT part of "schoolmaster's" resolves the tokens as schoolmaster / 's

Environment (please complete the following information):

  • OS: Mac OS Ventura
  • Python version: Python 3.12.2 using Poetry 1.8.2
  • Stanza version: Stanza from the dev branch up to commit b62c1e7

Additional context
I believe we found more errors like this, I will report them when I come across them.

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Mar 19, 2024 via email

@khannan-livefront
Copy link
Author

khannan-livefront commented Mar 20, 2024

@AngledLuffa Yes it appears the splitting mechanism is actually quite broken. We run stanza against gigantic amounts of text, with no problem retaining the correct text values when we did so against versions of Stanza before the introduction of MWT tokens. In our latest runs we encountered thousand's of errors involving them, and here are a sample of sentences where the token resolve the wrong capitalization or munges the text of the original word (sometimes inexplicably) for word's with an apostrophe / contraction / possessive:

In God’s name let him be so.
Didn’t I say so?
They began to be frightened at last at Pulcheria Alexandrovna's strange silence.
Pulcheria Alexandrovna's illness was a strange nervous one.
Couldn't he stop and retract it all?
She'll be my nurse.
It was quite an accident Lebeziatnikov's turning up.
Wasn't I right in saying that we were birds of a feather?
Couldn't he come?
She'll get it at the shop, my dear.
Who married the Marquis of Saint-Méran's daughter?
But to Dantès' eye there was no darkness.
Couldn’t he have waited for the good season to increase his chances?
She'd never noticed if it hadn't been for Sid.
Wasn't that a happy thought?

E.g. digging into the types of failures being seen, by showing the token split for the text value:

In god’s name let him be so. -> God's / god / 's (loss of capitalization)

Didn't I say so. -> Didn't / did / not (loss of capitalization)

They began to be frightened at last at Pulcheria Alexandrovna's strange silence. -> Alexandrovna's / Alexandronan / 's (flat out wrong)

Couldn't he stop and retract it all? -> Couldn't / could / n't (loss of capitalization)

It was quite an accident Lebeziatnikov's turning up. -> Lebeziatnikov's / Lebeziatikovv / 's (flat out wrong)

Who married the Marquis of Saint-Méran's daughter? -> Méran's / Mran / 's (clear bug, seeing the insertion of a <UNK> tag)

But to Dantès' eye there was no darkness. -> Dantès' / Dants / ' (clear bug here, and seeing the insertion of a <UNK> tag)


Sadly I think we might have to revert back to a version of Stanza without MWT-based tokens as they don't appear to be stable enough for our purposes to rely on. :(

@AngledLuffa
Copy link
Collaborator

This is probably fixable in one of a couple different ways. The English datasets are generally built so that the MWT are exactly composed of the subwords, so there's no reason to do anything other than split the original text. Alternatively, there may be some generalization of that to other languages where the model first checks if it's supposed to use an exact split of the words, and then it uses the seq2seq model only for words that aren't an exact split.

@khannan-livefront
Copy link
Author

Yeah that would be very helpful. The apostrophe splitting used to be very stable for the english model before the addition of MWT tokens.

@khannan-livefront
Copy link
Author

@AngledLuffa I should note also, this bug is occurring with the constituency parse also. We get schoolmaterr and 's as the leaf nodes for schoolmaster's

@AngledLuffa
Copy link
Collaborator

Thought about it some. Realized I was probably overthinking, and the easiest thing to do was continue using the current model and make it replace the prediction characters with the original text if the length added up to the original text. Only for languages where this property happens, of course.

@qipeng

@AngledLuffa
Copy link
Collaborator

The english_mwt branch should now have a fix for most of these issues, although I wouldn't be surprised if the new model still occasionally hallucinates text which isn't the right length (in which case the new splitting mechanism won't work). Please LMK if this helps, and feel free to report whichever words still aren't split correctly.

The constituency parser issue is just because the parser as part of the pipeline is using the MWT as input, so it's getting the weird splits just like the other models.

@AngledLuffa
Copy link
Collaborator

I should note that although this helps with the OOV characters, there is another issue with words at the start of a sentence being split with the dictionary lookup and then lowercased... I don't think that's correct behavior of the model, and I can probably fix that a lot faster than it took to fix this.

@AngledLuffa
Copy link
Collaborator

I think the use of lowercasing in the MWT is just a straight up logic bug:

elif w.lower() in self.expansion_dict:

elif c.lower() in self.expansion_dict:

My own expectation with tokenization is that it doesn't change the characters used unnecessarily, but consider:

>>> [x.text for x in pipe("JENNIFER HAS NICE ANTENNAE").sentences[0].words]
['JENNIFER', 'HAS', 'NICE', 'ANTENNAE']
>>> [x.text for x in pipe("JENNIFER'S GOT NICE ANTENNAE").sentences[0].words]
['JENNIFER', "'S", 'GOT', 'NICE', 'ANTENNAE']
>>> [x.text for x in pipe("SHE'S GOT NICE ANTENNAE").sentences[0].words]   # oops, this shows up in the dictionary
['she', "'s", 'GOT', 'NICE', 'ANTENNAE']

Maybe a reasonable fix would be to only look up all lowercase, leading uppercase, and all uppercase, and weird mixed cases just have to go through the seq2seq model

AngledLuffa added a commit that referenced this issue Apr 3, 2024
…original word matches one of a couple expected casing formats, in which case we can recreate those formats after using the dictionary lookup. Otherwise, you get unexpected tokenizations such as She's -> she 's. #1371
@AngledLuffa
Copy link
Collaborator

With these changes, here's what I get in the linked branch (which I'll merge into dev once I start hearing back from stakeholders in this issue) for the text you mentioned above

>>> text = """
... In God’s name let him be so.
... Didn’t I say so?
... They began to be frightened at last at Pulcheria Alexandrovna's strange silence.
... Pulcheria Alexandrovna's illness was a strange nervous one.
... Couldn't he stop and retract it all?
... She'll be my nurse.
... It was quite an accident Lebeziatnikov's turning up.
... Wasn't I right in saying that we were birds of a feather?
... Couldn't he come?
... She'll get it at the shop, my dear.
... Who married the Marquis of Saint-Méran's daughter?
... But to Dantès' eye there was no darkness.
... Couldn’t he have waited for the good season to increase his chances?
... She'd never noticed if it hadn't been for Sid.
... Wasn't that a happy thought?
... """
>>> text = text.strip().split("\n")
>>> for line in text:
...   print([x.text for x in pipe(line).sentences[0].words])


['In', 'God', '’s', 'name', 'let', 'him', 'be', 'so', '.']
['Did', 'n’t', 'I', 'say', 'so', '?']
['They', 'began', 'to', 'be', 'frightened', 'at', 'last', 'at', 'Pulcheria', 'Alexandrovna', "'s", 'strange', 'silence', '.']
['Pulcheria', 'Alexandrovna', "'s", 'illness', 'was', 'a', 'strange', 'nervous', 'one', '.']
['Could', "n't", 'he', 'stop', 'and', 'retract', 'it', 'all', '?']
['She', "'ll", 'be', 'my', 'nurse', '.']
['It', 'was', 'quite', 'an', 'accident', 'Lebeziatnikov', "'s", 'turning', 'up', '.']
['Was', "n't", 'I', 'right', 'in', 'saying', 'that', 'we', 'were', 'birds', 'of', 'a', 'feather', '?']
['Could', "n't", 'he', 'come', '?']
['She', "'ll", 'get', 'it', 'at', 'the', 'shop', ',', 'my', 'dear', '.']
['Who', 'married', 'the', 'Marquis', 'of', 'Saint', '-', 'Méran', "'s", 'daughter', '?']
['But', 'to', 'Dantès', "'", 'eye', 'there', 'was', 'no', 'darkness', '.']
['Could', 'n’t', 'he', 'have', 'waited', 'for', 'the', 'good', 'season', 'to', 'increase', 'his', 'chances', '?']
['She', "'d", 'never', 'noticed', 'if', 'it', 'had', "n't", 'been', 'for', 'Sid', '.']
['Was', "n't", 'that', 'a', 'happy', 'thought', '?']

@AngledLuffa
Copy link
Collaborator

This looks much better to me, but I will say there's an annoying 99.98% accuracy on the test set, whereas I would have expected 100% now that the casing is fixed and the model is forced to copy the input whenever possible. Hopefully it's just a random word which doesn't show up in the training data and isn't being correctly split, rather than one of the hallucinations or re-casings we're trying to fix

@AngledLuffa
Copy link
Collaborator

Eh, well, the non-100% appears to be entirely typos which were annotated in the test set and not handled in the expected manner by the model. Situations where the tokenizer isn't actually splitting an MWT, but if you force the MWT processor to make a decision, it doesn't really know how to process "Mens room" instead of "Men's room" etc

11944,11945c11943,11944
< 9     Cox
< 10    '
---
> 9     Co
> 10    x'
15578,15579c15577,15578
< 16    sheep
< 17    s
---
> 16    shee
> 17    ps
26469,26470c26467,26468
< 1     Men
< 2     s
---
> 1     Me
> 2     ns

@khannan-livefront
Copy link
Author

Thanks for looking into this @AngledLuffa!! I will update my stanza to the latest on the dev branch to pick up these changes.

@AngledLuffa
Copy link
Collaborator

Just to be clear, it's not merged yet - but probably soon, since it is passing the current unit tests and is a lot better on the test cases you gave us. Especially if you report back saying you like this branch more than the current release :)

AngledLuffa added a commit that referenced this issue Apr 4, 2024
…original word matches one of a couple expected casing formats, in which case we can recreate those formats after using the dictionary lookup. Otherwise, you get unexpected tokenizations such as She's -> she 's. #1371
@khannan-livefront
Copy link
Author

Just tested out the latest changes, no problems on my end!

@AngledLuffa
Copy link
Collaborator

This is now part of the 1.8.2 release

Jemoka pushed a commit that referenced this issue Jul 16, 2024
…original word matches one of a couple expected casing formats, in which case we can recreate those formats after using the dictionary lookup. Otherwise, you get unexpected tokenizations such as She's -> she 's. #1371
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants