-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stanza resolves wrong text for tokens in a multi-word token #1371
Comments
Ultimately we need a major upgrade to the tokenizer, including the MWT
splitting mechanism. There is a copy mechanism which tries to use the
original text, but clearly it misses some cases. If you collect a few of
these weird outliers, we'll rebuild the training data using them, and it
should improve the results.
…On Tue, Mar 19, 2024 at 3:58 PM Kelsey Hannan ***@***.***> wrote:
*Describe the bug*
In response to the changes of multi-word tokens in #1361
<#1361>, I encountered an
error with how Stanza generates tokens for words with apostrophes,
particularly contractions.
*To Reproduce*
Steps to reproduce the behavior:
1. Run the sentence:
The schoolmaster's wife started a sewing class.
1. Check the Universal Dependencies, in particular the tokens for
schoolmaster's reveal the incorrect base word of schoolmaterr:
// MWT token is correct:
{"end_char"=>18, "id"=>[2, 3], "start_char"=>4, "text"=>"schoolmaster's"},
// non-MWT text resolves incorrectly to "schoolmaterr"
{"deprel"=>"nmod:poss",
"feats"=>"Number=Sing",
"head"=>4,
"id"=>2,
"lemma"=>"schoolmaterr",
"text"=>"schoolmaterr",
"upos"=>"NOUN",
"xpos"=>"NN"},
// correct
{"deprel"=>"case", "head"=>2, "id"=>3, "lemma"=>"'s", "text"=>"'s", "upos"=>"PART", "xpos"=>"POS"},
*Expected behavior*
The non-MWT part of "schoolmaster's" resolves the tokens as schoolmaster
/ 's
*Environment (please complete the following information):*
- OS: Mac OS Ventura
- Python version: Python 3.12.2 using Poetry 1.8.2
- Stanza version: Stanza from the dev branch up to commit b62c1e7
<b62c1e7>
*Additional context*
I believe we found more errors like this, I will report them when I come
across them.
—
Reply to this email directly, view it on GitHub
<#1371>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWM2TEXVFYQL2NZANR3YZC7IPAVCNFSM6AAAAABE6OQT6KVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TMMJZGM4DSOI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
@AngledLuffa Yes it appears the splitting mechanism is actually quite broken. We run stanza against gigantic amounts of text, with no problem retaining the correct text values when we did so against versions of Stanza before the introduction of MWT tokens. In our latest runs we encountered thousand's of errors involving them, and here are a sample of sentences where the token resolve the wrong capitalization or munges the text of the original word (sometimes inexplicably) for word's with an apostrophe / contraction / possessive:
E.g. digging into the types of failures being seen, by showing the token split for the
Sadly I think we might have to revert back to a version of Stanza without MWT-based tokens as they don't appear to be stable enough for our purposes to rely on. :( |
This is probably fixable in one of a couple different ways. The English datasets are generally built so that the MWT are exactly composed of the subwords, so there's no reason to do anything other than split the original text. Alternatively, there may be some generalization of that to other languages where the model first checks if it's supposed to use an exact split of the words, and then it uses the seq2seq model only for words that aren't an exact split. |
Yeah that would be very helpful. The apostrophe splitting used to be very stable for the english model before the addition of MWT tokens. |
@AngledLuffa I should note also, this bug is occurring with the constituency parse also. We get |
Thought about it some. Realized I was probably overthinking, and the easiest thing to do was continue using the current model and make it replace the prediction characters with the original text if the length added up to the original text. Only for languages where this property happens, of course. |
The The constituency parser issue is just because the parser as part of the pipeline is using the MWT as input, so it's getting the weird splits just like the other models. |
I should note that although this helps with the OOV characters, there is another issue with words at the start of a sentence being split with the dictionary lookup and then lowercased... I don't think that's correct behavior of the model, and I can probably fix that a lot faster than it took to fix this. |
I think the use of lowercasing in the MWT is just a straight up logic bug: stanza/stanza/models/mwt/trainer.py Line 99 in c2d72bd
stanza/stanza/models/mwt/trainer.py Line 112 in c2d72bd
My own expectation with tokenization is that it doesn't change the characters used unnecessarily, but consider:
Maybe a reasonable fix would be to only look up all lowercase, leading uppercase, and all uppercase, and weird mixed cases just have to go through the seq2seq model |
…original word matches one of a couple expected casing formats, in which case we can recreate those formats after using the dictionary lookup. Otherwise, you get unexpected tokenizations such as She's -> she 's. #1371
With these changes, here's what I get in the linked branch (which I'll merge into
|
This looks much better to me, but I will say there's an annoying 99.98% accuracy on the test set, whereas I would have expected 100% now that the casing is fixed and the model is forced to copy the input whenever possible. Hopefully it's just a random word which doesn't show up in the training data and isn't being correctly split, rather than one of the hallucinations or re-casings we're trying to fix |
Eh, well, the non-100% appears to be entirely typos which were annotated in the test set and not handled in the expected manner by the model. Situations where the tokenizer isn't actually splitting an MWT, but if you force the MWT processor to make a decision, it doesn't really know how to process "Mens room" instead of "Men's room" etc
|
Thanks for looking into this @AngledLuffa!! I will update my stanza to the latest on the dev branch to pick up these changes. |
Just to be clear, it's not merged yet - but probably soon, since it is passing the current unit tests and is a lot better on the test cases you gave us. Especially if you report back saying you like this branch more than the current release :) |
…original word matches one of a couple expected casing formats, in which case we can recreate those formats after using the dictionary lookup. Otherwise, you get unexpected tokenizations such as She's -> she 's. #1371
Just tested out the latest changes, no problems on my end! |
This is now part of the 1.8.2 release |
…original word matches one of a couple expected casing formats, in which case we can recreate those formats after using the dictionary lookup. Otherwise, you get unexpected tokenizations such as She's -> she 's. #1371
Describe the bug
In response to the changes of multi-word tokens in #1361, I encountered an error with how Stanza generates tokens for words with apostrophes, particularly contractions.
To Reproduce
Steps to reproduce the behavior:
schoolmaster's
reveal the incorrect base word ofschoolmaterr
:Expected behavior
The non-MWT part of "schoolmaster's" resolves the tokens as
schoolmaster
/'s
Environment (please complete the following information):
dev
branch up to commit b62c1e7Additional context
I believe we found more errors like this, I will report them when I come across them.
The text was updated successfully, but these errors were encountered: