Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stanza 1.7.0+ makes breaking API changes for possessives, tokens excluding end_char and start_char fields #1361

Open
khannan-livefront opened this issue Mar 5, 2024 · 10 comments
Labels

Comments

@khannan-livefront
Copy link

khannan-livefront commented Mar 5, 2024

Describe the bug
I'm updating Stanza from 1.6.1 to 1.7.x / 1.8.x and noticed a number of breaking API changes in the Stanza Token result when handling possessives.

To Reproduce

  1. Send to stanza a sentence containing a possessive apostrophe like Joe's dog..
  2. Look at the Universal Dependencies.

Stanza now includes a new additional token that I'll call an "aggregate token" with the text field Joe's. This new aggregate token comes in addition to the tokens for Joe and 's. The new aggregate token returns an id with a list to the other two tokens:

// the new aggregate token appearing for each possessive apostrophe
    {
      "end_char": 5,
      "id": [
        1,
        2
      ],
      "start_char": 0,
      "text": "Joe's"
    },
 // child tokens missing start_char and end_char fields
    {
      "deprel": "nmod:poss",
      "feats": "Number=Sing",
      "head": 3,
      "id": 1,
      "lemma": "Joe",
      "text": "Joe",
      "upos": "PROPN",
      "xpos": "NNP"
    },
    {
      "deprel": "case",
      "head": 1,
      "id": 2,
      "lemma": "'s",
      "text": "'s",
      "upos": "PART",
      "xpos": "POS"
    },
    // normal tokens
    {
      "deprel": "root",
      "end_char": 9,
      "feats": "Number=Sing",
      "head": 0,
      "id": 3,
      "lemma": "dog",
      "start_char": 6,
      "text": "dog",
      "upos": "NOUN",
      "xpos": "NN"
    },
    {
      "deprel": "punct",
      "end_char": 10,
      "head": 3,
      "id": 4,
      "lemma": ".",
      "start_char": 9,
      "text": ".",
      "upos": "PUNCT",
      "xpos": "."
    }

This breaks the one-to-one mapping that used to exist between tokens and word elements within the s-expression returned by the constituency tree:

(ROOT (NP (NP (NNP Joe) (POS 's)) (NN dog) (. .)))

But more problematically, this new aggregate token is now the only token containing the end_char and start_char data about the word.

In addition to being a breaking change, this new approach is quite hard for application developers to work with. To parse it they need to chase down the ID links of the aggregate token when it intermittently appears to map its linguistic data. Moreover, important character information about where the character delineation between a word and its apostrophe is lost.

Expected behavior
For a possessive like Joe's dog., Stanza returns four dependency tokens as before in Stanza 1.6.1:

  {
    "deprel": "nmod:poss",
    "end_char": 3,
    "feats": "Number=Sing",
    "head": 3,
    "id": 1,
    "lemma": "Joe",
    "start_char": 0,
    "text": "Joe",
    "upos": "PROPN",
    "xpos": "NNP"
  },
  {
    "deprel": "case",
    "end_char": 5,
    "head": 1,
    "id": 2,
    "lemma": "'s",
    "start_char": 3,
    "text": "'s",
    "upos": "PART",
    "xpos": "POS"
  },
  {
    "deprel": "root",
    "end_char": 9,
    "feats": "Number=Sing",
    "head": 0,
    "id": 3,
    "lemma": "dog",
    "start_char": 6,
    "text": "dog",
    "upos": "NOUN",
    "xpos": "NN"
  },
  {
    "deprel": "punct",
    "end_char": 10,
    "head": 3,
    "id": 4,
    "lemma": ".",
    "start_char": 9,
    "text": ".",
    "upos": "PUNCT",
    "xpos": "."
  }

Or if a fifth aggregate token with an array of id's continues to be returned, the non-aggregate child tokens at least retain their own end_char and start_char information as before. This would at least allow developers to ignore these aggregate tokens, and preserve information about the character delineation between each token.

Environment (please complete the following information):

  • OS: MacOS Ventura 13.4
  • Python version: Python 3.12.2 using Poetry 1.8.2
  • Stanza version: 1.6.1 moving to 1.7.x / 1.8.x
@AngledLuffa
Copy link
Collaborator

As you have surmised, this was an intentional breaking change. English was actually handled differently from almost every other language where multiple syntactic "words" are written as a single "token". In general they are labeled as MWT (multi-word tokens), such as in Spanish, where the direct and indirect object pronouns can be attached to certain forms of verbs. In the case of English, there are a few classes of words which fit that category:

possessives
contractions: can't, won't, ...
contractions which don't even have ': gonna, wanna, cannot, ...

So the first thing you can do is do your processing on the words instead of the tokens on a sentence, such as

pipe = stanza.Pipeline("en", processors="tokenize")
doc = pipe("This change is gonna annoy people")
doc.sentences[0].words[4]

{
  "id": 4,
  "text": "gon"
}

If you're using the json output format, the MWT are always marked as having an id of more than one value

>>> doc.sentences[0].tokens[3]
[
  {
    "id": [
      4,
      5
    ],
    "text": "gonna",
    "start_char": 15,
    "end_char": 20
  },
  {
    "id": 4,
    "text": "gon"
  },
  {
    "id": 5,
    "text": "na"
  }
]

As you point out, this is missing the character positions on the words. This is because, in some languages, the tokenization standard is to rewrite the word pieces to match the actual word, so we'd have going to instead of gon na as the text of those words. Still, I can see how it would be useful to put start_char and end_char on the words if the word pieces happen to add up to the MWT, so I can make that a TODO. In the English datasets, the standard is to split the original text into pieces which correspond to the actual text rather than rewriting.

There's another really annoying issue, which is that the NER training can be either MWT or words (generally words), whereas for some reason the NER processor uses the MWT instead of the words. As a result, it doesn't always correctly label possessives. I should mark that as another TODO.

If there are other items which would make this more compatible with your previous workflow, please let us know

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Mar 6, 2024

  • NER might need to use Words instead of Tokens
  • keep the start and end chars if the Words add up to the original Tokens

@khannan-livefront
Copy link
Author

Thank you for your prompt response @AngledLuffa! Adding start_char and end_char back to the original tokens would be very helpful for us. This would allow us to skip the processing of MWT tokens.

Since you asked, our biggest architectural dependency with Stanza is that we rely on there being a one-to-one mapping of Universal Dependencies to leaf nodes of the Constituency Parse. The two inputs are mapped as linked objects in our system. E.g. so currently the word can't maps as two tokens to:

can
not

and in the constituency tree, to two leaf nodes:

(MD ca)
(RB n't)

and this one-to-one relationship is linked in our system as objects: a given token can fetch its constituent node, and vice versa. So if "can't" becomes one MWT token, our system would break unless the constituency tree also maps "can't" as a single leaf node.

Thankfully it looks I can still maintain this relationship by skipping over MWT tokens as the original tokens still have this one-to-one mapping in Stanza 1.8.1. Returning the start_char and end_char fields back to the original tokens would be all that's needed to give us a smooth upgrade path.

Thanks again for the prompt response!

AngledLuffa added a commit that referenced this issue Mar 11, 2024
…start_char and end_char on it. Note that there will still be no start_char and end_char annotations on words if the words don't add up to the token's text, so even in a language like English where the standard is to annotate the datasets so that they correspond to the pieces of the real text instead of the word being represented, there may be unusual separations in the MWT processor. #1361
AngledLuffa added a commit that referenced this issue Mar 11, 2024
…start_char and end_char on it.

Note that there will still be no start_char and end_char annotations
on words if the words don't add up to the token's text, so even in a
language like English where the standard is to annotate the datasets
so that they correspond to the pieces of the real text instead of the
word being represented, there may be unusual separations in the MWT
processor that result in no start/end char

Fix a unit test error

#1361
@AngledLuffa
Copy link
Collaborator

Alright, I separated the Word start & end chars for situations where the pieces add up to the surrounding Token (the Token, again, being the MWT representation). I should emphasize that may be cases in English where the pieces it adds up are not actually the full Token, in which case there won't be a start & end char. If'n you come across those and it isn't properly tokenizing them, we can take a look. The change is currently in the dev branch.

AngledLuffa added a commit that referenced this issue Mar 11, 2024
…start_char and end_char on it.

Note that there will still be no start_char and end_char annotations
on words if the words don't add up to the token's text, so even in a
language like English where the standard is to annotate the datasets
so that they correspond to the pieces of the real text instead of the
word being represented, there may be unusual separations in the MWT
processor that result in no start/end char

Fix a unit test error

#1361
@khannan-livefront
Copy link
Author

@AngledLuffa Just built a version of Stanza from the dev branch with the latest changes, and I can see that the start_char and end_char are back. Our API integration is working again when I add a small check to exclude MWT tokens. Thank you so much @AngledLuffa!! ✨ 🤩

@khannan-livefront
Copy link
Author

@AngledLuffa Do you know when the next release of Stanza will be so I can leap frog to that one?

@AngledLuffa
Copy link
Collaborator

Depends on if any show-stopping bugs show up, I suppose. Probably a couple months if nothing critical comes up

@AngledLuffa AngledLuffa reopened this Mar 14, 2024
@AngledLuffa
Copy link
Collaborator

I'd actually prefer to leave this as open until I figure out what to do with NER tags, btw

@AngledLuffa
Copy link
Collaborator

This is now part of the 1.8.2 release

@khannan-livefront
Copy link
Author

Thanks @AngledLuffa we have now migrated to the Stanza 1.8.2!

@AngledLuffa AngledLuffa reopened this Jun 22, 2024
Jemoka pushed a commit that referenced this issue Jul 16, 2024
…start_char and end_char on it.

Note that there will still be no start_char and end_char annotations
on words if the words don't add up to the token's text, so even in a
language like English where the standard is to annotate the datasets
so that they correspond to the pieces of the real text instead of the
word being represented, there may be unusual separations in the MWT
processor that result in no start/end char

Fix a unit test error

#1361
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants