Lemmatisation of "did" is no longer "do" in the context of n't #13098

chrisjbryant · 2023-11-02T06:34:59Z

I just updated from spacy 3.6 to 3.7 and "did" is no longer getting lemmatised as "do" when followed by n't.

nlp = spacy.load("en_core_web_sm")
sent = "I didn't go to the bank at the weekend."
for tok in nlp(sent):
    print(tok.text, tok.lemma_)

Output:

I I
did did      # lemma should be "do"
n't n't
go go
to to
the the
bank bank
at at
the the
weekend weekend
. .

How to reproduce the behaviour

spaCy version: 3.7.2
Platform: Linux-6.2.0-1017-gcp-x86_64-with-glibc2.35
Python version: 3.10.12
Pipelines: en_core_web_sm (3.7.0)

The text was updated successfully, but these errors were encountered:

Kepsondiaz · 2023-11-02T21:21:43Z

hello @chrisjbryant
I think the error comes from the code you provided.
you should process the text with nlp function before doing the for loop like this:

doc = nlp(sent)

Kepsondiaz · 2023-11-02T21:22:47Z

your code should look like this :

`nlp = spacy.load("en_core_web_sm")
sent = "I didn't go to the bank at the weekend."
doc = nlp(sent)

for tok in doc:
print(tok.text, tok.lemma_)
`

Kepsondiaz · 2023-11-02T21:23:00Z

Output

Kepsondiaz · 2023-11-02T21:24:28Z

I -PRON-
did do
n't not
go go
to to
the the
bank bank
at at
the the
weekend weekend
. .

chrisjbryant · 2023-11-03T03:42:31Z

My bad, that was a typo in my example code. I have updated it now. The error is still present in spacy.

rmitsch · 2023-11-03T08:30:18Z

Hi @chrisjbryant, can you set up a fresh virtual environment and run this script there? I can't reproduce this. If I run

import spacy
print(spacy.__version__)
nlp = spacy.load("en_core_web_sm")
sent = "I didn't go to the bank at the weekend."
for tok in nlp(sent):
    print(tok.text, tok.lemma_)

my output is:

3.7.2
I I
did do
n't not
go go
to to
the the
bank bank
at at
the the
weekend weekend
. .

chrisjbryant · 2023-11-03T08:38:27Z

I was previously using a conda env, but just tried with venv and same result. :/

Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacy
>>> print(spacy.__version__)
3.7.2
>>> nlp = spacy.load("en_core_web_sm")
>>> sent = "I didn't go to the bank at the weekend."
>>> for tok in nlp(sent):
...     print(tok.text, tok.lemma_)
... 
I I
did did
n't n't
go go
to to
the the
bank bank
at at
the the
weekend weekend
. .
>>>

And just for sanity checking, here is what happens when I remove the negation (i.e. it's correct).

>>> sent = "I did go to the bank at the weekend."
>>> for tok in nlp(sent):
...     print(tok.text, tok.lemma_)
... 
I I
did do
go go
to to
the the
bank bank
at at
the the
weekend weekend
. .

Update Part 3: It seems the same thing happens with hadn't, so not sure what's going on.

>>> sent = "I hadn't been to the bank for ages."
>>> for tok in nlp(sent):
...     print(tok.text, tok.lemma_)
... 
I I
had had
n't n't
been be
to to
the the
bank bank
for for
ages age
. .

rmitsch · 2023-11-03T08:51:57Z

Huh, after noticing I ran this with Python 3.9 I set up a clean venv with Python 3.10 - there I can reproduce this. We'll look into it.

adrianeboyd · 2023-11-03T09:03:41Z

I can confirm that this is a bug. A few of the attribute ruler patterns were switched from LOWER to NORM in v3.7.x (to generalize a bit better overall), but "did" is already normalized to "do" by the tokenizer, and then the "did n't" pattern no longer matches as intended.

You can add the old rule back to the current en_core_* models as a workaround:

nlp.get_pipe("attribute_ruler").add(
    patterns=[[{"TAG": "VBD", "LOWER": "did"}]],
    attrs={"POS": "AUX", "LEMMA": "do"},
    index=0,
)

chrisjbryant · 2023-11-03T10:21:08Z

Just to make sure, don't forget to check other cases too, since I now also recreated it with hadn't!

adrianeboyd · 2023-11-03T10:55:07Z

Yeah, we'll review all the changes. (The underlying idea was to use NORM to match more variants of contractions with apostrophes.)

adrianeboyd · 2023-11-06T14:32:49Z

Thanks again for the report, we've updated this in the internal attribute ruler data and added a test to spacy-models, so let's hope this is fixed in the next round of pipeline releases (probably pipeline version v3.8.0).

github-actions · 2023-11-14T00:05:45Z

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions · 2023-12-15T00:02:19Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

chrisjbryant changed the title ~~Lemmatisation of "did" is no longer "do" in the context of n't'~~ Lemmatisation of "did" is no longer "do" in the context of n't Nov 2, 2023

rmitsch added the feat / lemmatizer Feature: Rule-based and lookup lemmatization label Nov 2, 2023

chrisjbryant closed this as completed Nov 3, 2023

chrisjbryant reopened this Nov 3, 2023

rmitsch added bug Bugs and behaviour differing from documentation lang / en English language data and models labels Nov 3, 2023

adrianeboyd added the resolved The issue was addressed / answered label Nov 6, 2023

DavidRamosSal mentioned this issue Nov 13, 2023

Some tokenizer exceptions not being applied #13126

Closed

github-actions bot closed this as completed Nov 14, 2023

github-actions bot removed the resolved The issue was addressed / answered label Nov 14, 2023

github-actions bot locked as resolved and limited conversation to collaborators Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lemmatisation of "did" is no longer "do" in the context of n't #13098

Lemmatisation of "did" is no longer "do" in the context of n't #13098

chrisjbryant commented Nov 2, 2023 •

edited

Loading

Kepsondiaz commented Nov 2, 2023

Kepsondiaz commented Nov 2, 2023

Kepsondiaz commented Nov 2, 2023

Kepsondiaz commented Nov 2, 2023

chrisjbryant commented Nov 3, 2023

rmitsch commented Nov 3, 2023

chrisjbryant commented Nov 3, 2023 •

edited

Loading

rmitsch commented Nov 3, 2023

adrianeboyd commented Nov 3, 2023

chrisjbryant commented Nov 3, 2023

adrianeboyd commented Nov 3, 2023

adrianeboyd commented Nov 6, 2023

github-actions bot commented Nov 14, 2023

github-actions bot commented Dec 15, 2023

Lemmatisation of "did" is no longer "do" in the context of n't #13098

Lemmatisation of "did" is no longer "do" in the context of n't #13098

Comments

chrisjbryant commented Nov 2, 2023 • edited Loading

How to reproduce the behaviour

Kepsondiaz commented Nov 2, 2023

Kepsondiaz commented Nov 2, 2023

Kepsondiaz commented Nov 2, 2023

Kepsondiaz commented Nov 2, 2023

chrisjbryant commented Nov 3, 2023

rmitsch commented Nov 3, 2023

chrisjbryant commented Nov 3, 2023 • edited Loading

rmitsch commented Nov 3, 2023

adrianeboyd commented Nov 3, 2023

chrisjbryant commented Nov 3, 2023

adrianeboyd commented Nov 3, 2023

adrianeboyd commented Nov 6, 2023

github-actions bot commented Nov 14, 2023

github-actions bot commented Dec 15, 2023

chrisjbryant commented Nov 2, 2023 •

edited

Loading

chrisjbryant commented Nov 3, 2023 •

edited

Loading