Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lemmatisation of "did" is no longer "do" in the context of n't #13098

Closed
chrisjbryant opened this issue Nov 2, 2023 · 14 comments
Closed

Lemmatisation of "did" is no longer "do" in the context of n't #13098

chrisjbryant opened this issue Nov 2, 2023 · 14 comments
Labels
bug Bugs and behaviour differing from documentation feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / en English language data and models

Comments

@chrisjbryant
Copy link

chrisjbryant commented Nov 2, 2023

I just updated from spacy 3.6 to 3.7 and "did" is no longer getting lemmatised as "do" when followed by n't.

nlp = spacy.load("en_core_web_sm")
sent = "I didn't go to the bank at the weekend."
for tok in nlp(sent):
    print(tok.text, tok.lemma_)

Output:

I I
did did      # lemma should be "do"
n't n't
go go
to to
the the
bank bank
at at
the the
weekend weekend
. .

How to reproduce the behaviour

  • spaCy version: 3.7.2
  • Platform: Linux-6.2.0-1017-gcp-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Pipelines: en_core_web_sm (3.7.0)
@chrisjbryant chrisjbryant changed the title Lemmatisation of "did" is no longer "do" in the context of n't' Lemmatisation of "did" is no longer "do" in the context of n't Nov 2, 2023
@rmitsch rmitsch added the feat / lemmatizer Feature: Rule-based and lookup lemmatization label Nov 2, 2023
@Kepsondiaz
Copy link

hello @chrisjbryant
I think the error comes from the code you provided.
you should process the text with nlp function before doing the for loop like this:

doc = nlp(sent)

@Kepsondiaz
Copy link

your code should look like this :

`nlp = spacy.load("en_core_web_sm")
sent = "I didn't go to the bank at the weekend."
doc = nlp(sent)

for tok in doc:
print(tok.text, tok.lemma_)
`

@Kepsondiaz
Copy link

Output

@Kepsondiaz
Copy link

I -PRON-
did do
n't not
go go
to to
the the
bank bank
at at
the the
weekend weekend
. .

@chrisjbryant
Copy link
Author

My bad, that was a typo in my example code. I have updated it now. The error is still present in spacy.

@rmitsch
Copy link
Contributor

rmitsch commented Nov 3, 2023

Hi @chrisjbryant, can you set up a fresh virtual environment and run this script there? I can't reproduce this. If I run

import spacy
print(spacy.__version__)
nlp = spacy.load("en_core_web_sm")
sent = "I didn't go to the bank at the weekend."
for tok in nlp(sent):
    print(tok.text, tok.lemma_)

my output is:

3.7.2
I I
did do
n't not
go go
to to
the the
bank bank
at at
the the
weekend weekend
. .

@chrisjbryant
Copy link
Author

chrisjbryant commented Nov 3, 2023

I was previously using a conda env, but just tried with venv and same result. :/

Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacy
>>> print(spacy.__version__)
3.7.2
>>> nlp = spacy.load("en_core_web_sm")
>>> sent = "I didn't go to the bank at the weekend."
>>> for tok in nlp(sent):
...     print(tok.text, tok.lemma_)
... 
I I
did did
n't n't
go go
to to
the the
bank bank
at at
the the
weekend weekend
. .
>>> 

And just for sanity checking, here is what happens when I remove the negation (i.e. it's correct).

>>> sent = "I did go to the bank at the weekend."
>>> for tok in nlp(sent):
...     print(tok.text, tok.lemma_)
... 
I I
did do
go go
to to
the the
bank bank
at at
the the
weekend weekend
. .

Update Part 3: It seems the same thing happens with hadn't, so not sure what's going on.

>>> sent = "I hadn't been to the bank for ages."
>>> for tok in nlp(sent):
...     print(tok.text, tok.lemma_)
... 
I I
had had
n't n't
been be
to to
the the
bank bank
for for
ages age
. .

@rmitsch
Copy link
Contributor

rmitsch commented Nov 3, 2023

Huh, after noticing I ran this with Python 3.9 I set up a clean venv with Python 3.10 - there I can reproduce this. We'll look into it.

@rmitsch rmitsch added bug Bugs and behaviour differing from documentation lang / en English language data and models labels Nov 3, 2023
@adrianeboyd
Copy link
Contributor

I can confirm that this is a bug. A few of the attribute ruler patterns were switched from LOWER to NORM in v3.7.x (to generalize a bit better overall), but "did" is already normalized to "do" by the tokenizer, and then the "did n't" pattern no longer matches as intended.

You can add the old rule back to the current en_core_* models as a workaround:

nlp.get_pipe("attribute_ruler").add(
    patterns=[[{"TAG": "VBD", "LOWER": "did"}]],
    attrs={"POS": "AUX", "LEMMA": "do"},
    index=0,
)

@chrisjbryant
Copy link
Author

Just to make sure, don't forget to check other cases too, since I now also recreated it with hadn't!

@adrianeboyd
Copy link
Contributor

Yeah, we'll review all the changes. (The underlying idea was to use NORM to match more variants of contractions with apostrophes.)

@adrianeboyd
Copy link
Contributor

Thanks again for the report, we've updated this in the internal attribute ruler data and added a test to spacy-models, so let's hope this is fixed in the next round of pipeline releases (probably pipeline version v3.8.0).

@adrianeboyd adrianeboyd added the resolved The issue was addressed / answered label Nov 6, 2023
Copy link
Contributor

This issue has been automatically closed because it was answered and there was no follow-up discussion.

@github-actions github-actions bot removed the resolved The issue was addressed / answered label Nov 14, 2023
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / en English language data and models
Projects
None yet
Development

No branches or pull requests

4 participants