Stanza 1.7.0+ makes breaking API changes for possessives, tokens excluding `end_char` and `start_char` fields #1361

khannan-livefront · 2024-03-05T22:54:18Z

Describe the bug
I'm updating Stanza from 1.6.1 to 1.7.x / 1.8.x and noticed a number of breaking API changes in the Stanza Token result when handling possessives.

To Reproduce

Send to stanza a sentence containing a possessive apostrophe like Joe's dog..
Look at the Universal Dependencies.

Stanza now includes a new additional token that I'll call an "aggregate token" with the text field Joe's. This new aggregate token comes in addition to the tokens for Joe and 's. The new aggregate token returns an id with a list to the other two tokens:

// the new aggregate token appearing for each possessive apostrophe
    {
      "end_char": 5,
      "id": [
        1,
        2
      ],
      "start_char": 0,
      "text": "Joe's"
    },
 // child tokens missing start_char and end_char fields
    {
      "deprel": "nmod:poss",
      "feats": "Number=Sing",
      "head": 3,
      "id": 1,
      "lemma": "Joe",
      "text": "Joe",
      "upos": "PROPN",
      "xpos": "NNP"
    },
    {
      "deprel": "case",
      "head": 1,
      "id": 2,
      "lemma": "'s",
      "text": "'s",
      "upos": "PART",
      "xpos": "POS"
    },
    // normal tokens
    {
      "deprel": "root",
      "end_char": 9,
      "feats": "Number=Sing",
      "head": 0,
      "id": 3,
      "lemma": "dog",
      "start_char": 6,
      "text": "dog",
      "upos": "NOUN",
      "xpos": "NN"
    },
    {
      "deprel": "punct",
      "end_char": 10,
      "head": 3,
      "id": 4,
      "lemma": ".",
      "start_char": 9,
      "text": ".",
      "upos": "PUNCT",
      "xpos": "."
    }

This breaks the one-to-one mapping that used to exist between tokens and word elements within the s-expression returned by the constituency tree:

(ROOT (NP (NP (NNP Joe) (POS 's)) (NN dog) (. .)))

But more problematically, this new aggregate token is now the only token containing the end_char and start_char data about the word.

In addition to being a breaking change, this new approach is quite hard for application developers to work with. To parse it they need to chase down the ID links of the aggregate token when it intermittently appears to map its linguistic data. Moreover, important character information about where the character delineation between a word and its apostrophe is lost.

Expected behavior
For a possessive like Joe's dog., Stanza returns four dependency tokens as before in Stanza 1.6.1:

  {
    "deprel": "nmod:poss",
    "end_char": 3,
    "feats": "Number=Sing",
    "head": 3,
    "id": 1,
    "lemma": "Joe",
    "start_char": 0,
    "text": "Joe",
    "upos": "PROPN",
    "xpos": "NNP"
  },
  {
    "deprel": "case",
    "end_char": 5,
    "head": 1,
    "id": 2,
    "lemma": "'s",
    "start_char": 3,
    "text": "'s",
    "upos": "PART",
    "xpos": "POS"
  },
  {
    "deprel": "root",
    "end_char": 9,
    "feats": "Number=Sing",
    "head": 0,
    "id": 3,
    "lemma": "dog",
    "start_char": 6,
    "text": "dog",
    "upos": "NOUN",
    "xpos": "NN"
  },
  {
    "deprel": "punct",
    "end_char": 10,
    "head": 3,
    "id": 4,
    "lemma": ".",
    "start_char": 9,
    "text": ".",
    "upos": "PUNCT",
    "xpos": "."
  }

Or if a fifth aggregate token with an array of id's continues to be returned, the non-aggregate child tokens at least retain their own end_char and start_char information as before. This would at least allow developers to ignore these aggregate tokens, and preserve information about the character delineation between each token.

Environment (please complete the following information):

OS: MacOS Ventura 13.4
Python version: Python 3.12.2 using Poetry 1.8.2
Stanza version: 1.6.1 moving to 1.7.x / 1.8.x

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2024-03-06T02:47:46Z

As you have surmised, this was an intentional breaking change. English was actually handled differently from almost every other language where multiple syntactic "words" are written as a single "token". In general they are labeled as MWT (multi-word tokens), such as in Spanish, where the direct and indirect object pronouns can be attached to certain forms of verbs. In the case of English, there are a few classes of words which fit that category:

possessives
contractions: can't, won't, ...
contractions which don't even have ': gonna, wanna, cannot, ...

So the first thing you can do is do your processing on the words instead of the tokens on a sentence, such as

pipe = stanza.Pipeline("en", processors="tokenize")
doc = pipe("This change is gonna annoy people")
doc.sentences[0].words[4]

{
  "id": 4,
  "text": "gon"
}

If you're using the json output format, the MWT are always marked as having an id of more than one value

>>> doc.sentences[0].tokens[3]
[
  {
    "id": [
      4,
      5
    ],
    "text": "gonna",
    "start_char": 15,
    "end_char": 20
  },
  {
    "id": 4,
    "text": "gon"
  },
  {
    "id": 5,
    "text": "na"
  }
]

As you point out, this is missing the character positions on the words. This is because, in some languages, the tokenization standard is to rewrite the word pieces to match the actual word, so we'd have going to instead of gon na as the text of those words. Still, I can see how it would be useful to put start_char and end_char on the words if the word pieces happen to add up to the MWT, so I can make that a TODO. In the English datasets, the standard is to split the original text into pieces which correspond to the actual text rather than rewriting.

There's another really annoying issue, which is that the NER training can be either MWT or words (generally words), whereas for some reason the NER processor uses the MWT instead of the words. As a result, it doesn't always correctly label possessives. I should mark that as another TODO.

If there are other items which would make this more compatible with your previous workflow, please let us know

AngledLuffa · 2024-03-06T02:49:23Z

NER might need to use Words instead of Tokens
keep the start and end chars if the Words add up to the original Tokens

khannan-livefront · 2024-03-06T20:43:00Z

Thank you for your prompt response @AngledLuffa! Adding start_char and end_char back to the original tokens would be very helpful for us. This would allow us to skip the processing of MWT tokens.

Since you asked, our biggest architectural dependency with Stanza is that we rely on there being a one-to-one mapping of Universal Dependencies to leaf nodes of the Constituency Parse. The two inputs are mapped as linked objects in our system. E.g. so currently the word can't maps as two tokens to:

can
not

and in the constituency tree, to two leaf nodes:

(MD ca)
(RB n't)

and this one-to-one relationship is linked in our system as objects: a given token can fetch its constituent node, and vice versa. So if "can't" becomes one MWT token, our system would break unless the constituency tree also maps "can't" as a single leaf node.

Thankfully it looks I can still maintain this relationship by skipping over MWT tokens as the original tokens still have this one-to-one mapping in Stanza 1.8.1. Returning the start_char and end_char fields back to the original tokens would be all that's needed to give us a smooth upgrade path.

Thanks again for the prompt response!

…start_char and end_char on it. Note that there will still be no start_char and end_char annotations on words if the words don't add up to the token's text, so even in a language like English where the standard is to annotate the datasets so that they correspond to the pieces of the real text instead of the word being represented, there may be unusual separations in the MWT processor. #1361

…start_char and end_char on it. Note that there will still be no start_char and end_char annotations on words if the words don't add up to the token's text, so even in a language like English where the standard is to annotate the datasets so that they correspond to the pieces of the real text instead of the word being represented, there may be unusual separations in the MWT processor that result in no start/end char Fix a unit test error #1361

AngledLuffa · 2024-03-11T04:11:36Z

Alright, I separated the Word start & end chars for situations where the pieces add up to the surrounding Token (the Token, again, being the MWT representation). I should emphasize that may be cases in English where the pieces it adds up are not actually the full Token, in which case there won't be a start & end char. If'n you come across those and it isn't properly tokenizing them, we can take a look. The change is currently in the dev branch.

…start_char and end_char on it. Note that there will still be no start_char and end_char annotations on words if the words don't add up to the token's text, so even in a language like English where the standard is to annotate the datasets so that they correspond to the pieces of the real text instead of the word being represented, there may be unusual separations in the MWT processor that result in no start/end char Fix a unit test error #1361

khannan-livefront · 2024-03-14T02:27:23Z

@AngledLuffa Just built a version of Stanza from the dev branch with the latest changes, and I can see that the start_char and end_char are back. Our API integration is working again when I add a small check to exclude MWT tokens. Thank you so much @AngledLuffa!! ✨ 🤩

khannan-livefront · 2024-03-14T02:28:42Z

@AngledLuffa Do you know when the next release of Stanza will be so I can leap frog to that one?

AngledLuffa · 2024-03-14T02:59:30Z

Depends on if any show-stopping bugs show up, I suppose. Probably a couple months if nothing critical comes up

AngledLuffa · 2024-03-14T03:00:06Z

I'd actually prefer to leave this as open until I figure out what to do with NER tags, btw

AngledLuffa · 2024-04-20T18:59:15Z

This is now part of the 1.8.2 release

khannan-livefront · 2024-05-01T19:23:16Z

Thanks @AngledLuffa we have now migrated to the Stanza 1.8.2!

…start_char and end_char on it. Note that there will still be no start_char and end_char annotations on words if the words don't add up to the token's text, so even in a language like English where the standard is to annotate the datasets so that they correspond to the pieces of the real text instead of the word being represented, there may be unusual separations in the MWT processor that result in no start/end char Fix a unit test error #1361

khannan-livefront added the bug label Mar 5, 2024

AngledLuffa mentioned this issue Mar 13, 2024

WARNING: Can not find mwt: default from official model list. Ignoring it. #297

Closed

khannan-livefront closed this as completed Mar 14, 2024

AngledLuffa reopened this Mar 14, 2024

khannan-livefront mentioned this issue Mar 19, 2024

Stanza resolves wrong text for tokens in a multi-word token #1371

Closed

AngledLuffa closed this as completed Apr 20, 2024

AngledLuffa reopened this Apr 20, 2024

AngledLuffa closed this as completed Jun 22, 2024

AngledLuffa reopened this Jun 22, 2024

sujoung mentioned this issue Sep 4, 2024

Apostrophe bug for Stanza English tokenizer model #1417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stanza 1.7.0+ makes breaking API changes for possessives, tokens excluding `end_char` and `start_char` fields #1361

Stanza 1.7.0+ makes breaking API changes for possessives, tokens excluding `end_char` and `start_char` fields #1361

khannan-livefront commented Mar 5, 2024 •

edited

Loading

AngledLuffa commented Mar 6, 2024

AngledLuffa commented Mar 6, 2024 •

edited

Loading

khannan-livefront commented Mar 6, 2024

AngledLuffa commented Mar 11, 2024

khannan-livefront commented Mar 14, 2024

khannan-livefront commented Mar 14, 2024

AngledLuffa commented Mar 14, 2024

AngledLuffa commented Mar 14, 2024

AngledLuffa commented Apr 20, 2024

khannan-livefront commented May 1, 2024

Stanza 1.7.0+ makes breaking API changes for possessives, tokens excluding end_char and start_char fields #1361

Stanza 1.7.0+ makes breaking API changes for possessives, tokens excluding end_char and start_char fields #1361

Comments

khannan-livefront commented Mar 5, 2024 • edited Loading

AngledLuffa commented Mar 6, 2024

AngledLuffa commented Mar 6, 2024 • edited Loading

khannan-livefront commented Mar 6, 2024

AngledLuffa commented Mar 11, 2024

khannan-livefront commented Mar 14, 2024

khannan-livefront commented Mar 14, 2024

AngledLuffa commented Mar 14, 2024

AngledLuffa commented Mar 14, 2024

AngledLuffa commented Apr 20, 2024

khannan-livefront commented May 1, 2024

Stanza 1.7.0+ makes breaking API changes for possessives, tokens excluding `end_char` and `start_char` fields #1361

Stanza 1.7.0+ makes breaking API changes for possessives, tokens excluding `end_char` and `start_char` fields #1361

khannan-livefront commented Mar 5, 2024 •

edited

Loading

AngledLuffa commented Mar 6, 2024 •

edited

Loading