Integrating spacy-huggingface-pipelines and refactoring NlpEngine logic #1159

omri374 · 2023-08-28T19:36:01Z

Change Description

This PR improves the handling of transformers models using the NlpEngine flow. Note that there are two ways to introduce NER models into presidio: NlpEngine and a standalone recognizer. See more info on #1083.

Main changes:

The entire handling of models, entities, mapping and alignment of model spans to Presidio's spans is done as part of the NlpEngine config (either SpacyNlpEngine or TransformersNlpEngine). In the future, we could also add FlairNlpEngine or other NER packages.
Introduced a new NerModelConfiguration dataclass which holds the user configuration. Configuration can be set in a conf file, for example:

with most of the configuration coming from here: https://github.com/explosion/spacy-huggingface-pipelines#token-classification
Simplified the logic in SpacyRecognizer which now only gets the entities out of NlpArtifacts and returns them
Integrated spacy-huggingface-pipelines to improve the handling of transformer models inside spaCy pipelines.

Flow before:

sequenceDiagram
    AnalyzerEngine->>SpacyNlpEngine: Call engine.process_text(text) <br>to get model results
    SpacyNlpEngine->>Spacy: call spaCy NER model
    Spacy->>SpacyNlpEngine: return PII entities
    SpacyNlpEngine->>AnalyzerEngine: Pass NlpArtifacts<BR>(Entities, lemmas, tokens etc.)
    Note over AnalyzerEngine: Call all recognizers
    AnalyzerEngine->>SpacyRecognizer: Pass NlpArtifacts
    Note over SpacyRecognizer: Extract PII entities out of NlpArtifacts, <BR>map to Presidio entities, <BR>filter requested entities, <BR> add confidence scores
    SpacyRecognizer->>AnalyzerEngine: Return List[RecognizerResult]

Flow after:

sequenceDiagram
    AnalyzerEngine->>SpacyNlpEngine: Call engine.process_text(text) <br>to get model results
    SpacyNlpEngine->>Spacy: call spaCy NER model
    Spacy->>SpacyNlpEngine: return PII entities
    Note over SpacyNlpEngine: Map entity names to Presidio's, <BR>update scores, <BR>remove unwanted entities <BR> based on NerModelConfiguration
    SpacyNlpEngine->>AnalyzerEngine: Pass NlpArtifacts<BR>(Entities, lemmas, tokens, scores etc.)
    Note over AnalyzerEngine: Call all recognizers
    AnalyzerEngine->>SpacyRecognizer: Pass NlpArtifacts
    Note over SpacyRecognizer: Extract PII entities out of NlpArtifacts
    SpacyRecognizer->>AnalyzerEngine: Return List[RecognizerResult]

Note to reviewer: A PR with updates to docs can be found here #1177

Issue reference

This PR fixes issue #1083

Checklist

I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests: partially/wip
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required: wip

…_engine # Conflicts: # presidio-analyzer/Pipfile

…to omri/new_transformers_engine

omri374 · 2023-08-28T20:12:16Z

/azp run

azure-pipelines · 2023-08-28T20:12:24Z

Azure Pipelines successfully started running 1 pipeline(s).

…to omri/new_transformers_engine

omri374 · 2023-08-29T06:39:32Z

/azp run

azure-pipelines · 2023-08-29T06:39:41Z

Azure Pipelines successfully started running 1 pipeline(s).

LSD-98 · 2023-10-02T08:03:21Z

Hello @omri374,

By running some test i got a warning from spacy-huggingface-pipeline:
UserWarning: Skipping annotation, {'entity_group': 'LOC', 'score': 0.46050456, 'word': 'ian', 'start': 33, 'end': 36} is overlapping or can't be aligned for doc 'Je suis Tom et je travaille à Ardian'

In the end, it detected the right entities to anonymize but I wondered whether it could lead to a "false negative" (i.e. skipping an entity that should be detected). I saw you also had the same issue here. I did not understand the technical points of the answer so I wanted to check if you got a solution for this.

Many thanks in advance!

PS : please tell me if I should post here or on the discussion tab.

omri374 · 2023-10-04T09:57:34Z

@LSD-98 thanks for raising this. This is caused by a mismatch between wordpiece tokenization (in transformers) and the spaCy tokenizer. It is defined as "expand" as usually wordpiece tokens are a subset of a spaCy token, but there could be cases where this might result in a false negative or false positive (most likely false positive, as the wordpiece token would be expanded to more than the PII itself). I don't see an immediate workaround for this, so I guess we'd have to live with it. If someone is making sure there are no alignment errors, they could still add a transformers recognizer as an independent recognizer and not as part of the NLPEngine mechanism.

omri374 · 2023-10-17T07:20:41Z

/azp run

azure-pipelines · 2023-10-17T07:20:53Z

Azure Pipelines successfully started running 1 pipeline(s).

niwilso

Overall looks great! Thank you for providing the swimlane flow diagram.

Just a few points to address before approval.

presidio-analyzer/Pipfile

presidio-analyzer/conf/transformers.yaml

presidio-analyzer/presidio_analyzer/nlp_engine/ner_model_configuration.py

presidio-analyzer/presidio_analyzer/nlp_engine/nlp_artifacts.py

presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py

presidio-analyzer/presidio_analyzer/nlp_engine/spacy_nlp_engine.py

presidio-analyzer/presidio_analyzer/nlp_engine/transformers_nlp_engine.py

…to omri/new_transformers_engine

omri374 added 15 commits August 28, 2023 22:34

integrating spacy-huggingface-pipeliens and refactoring NlpEngine logic

a59d67b

Update languages-config.yml

cf44222

Update customizing_nlp_models.md

7db2320

Merge remote-tracking branch 'origin/main' into omri/new_transformers…

b8b7be7

…_engine # Conflicts: # presidio-analyzer/Pipfile

Update customizing_nlp_models.md

ede9f9a

Update default.yaml

d3edce8

Update languages-config.yml

8508c00

Update default.yaml

800843d

Update spacy_multilingual.yaml

f066c31

Update stanza.yaml

03a7ed8

Update stanza_multilingual.yaml

971c7ec

default_score from config + more logging

1263f21

Merge remote-tracking branch 'origin/omri/new_transformers_engine' in…

5d6de3c

…to omri/new_transformers_engine

flake8 updates

b353c5c

added en_core_web_sm for transformers pipelines

a740c2d

omri374 added 2 commits August 28, 2023 23:21

removed type_checking option

b6170b6

Merge remote-tracking branch 'origin/omri/new_transformers_engine' in…

b924242

…to omri/new_transformers_engine

omri374 mentioned this pull request Aug 29, 2023

Using Presidio with Huggingface support #1083

Closed

omri374 marked this pull request as ready for review August 29, 2023 08:01

omri374 requested a review from a team as a code owner August 29, 2023 08:01

omri374 added 6 commits August 29, 2023 12:24

add transformers_recognizer test

f0a9924

formatting

28472ab

updated docstring

39103e2

revert formatting

1aca153

revert formatting

db7c45f

ignore type checking errors (TC001 TC002 TC003)

e8a814f

SharonHart previously approved these changes Oct 1, 2023

View reviewed changes

Merge branch 'main' into omri/new_transformers_engine

747c033

omri374 added 2 commits October 4, 2023 14:42

Merge branch 'main' into omri/new_transformers_engine

b004924

mapped FAC to LOCATION

9b7584e

omri374 dismissed SharonHart’s stale review via 9b7584e October 9, 2023 12:38

mapped FAC to LOCATION

5822a00

SharonHart previously approved these changes Oct 12, 2023

View reviewed changes

navalev previously approved these changes Oct 17, 2023

View reviewed changes

Merge branch 'main' into omri/new_transformers_engine

f2a4a29

niwilso requested changes Oct 17, 2023

View reviewed changes

Update NOTICE

e9007d0

omri374 dismissed stale reviews from SharonHart and navalev via e9007d0 October 18, 2023 10:52

omri374 added 5 commits October 18, 2023 13:53

added stanza to NOTICE

06348b9

fixed comment indentation

1db174e

removed unnecessary quotes

31149dd

Merge remote-tracking branch 'origin/omri/new_transformers_engine' in…

ffc4a25

…to omri/new_transformers_engine

revert removing quotes, cannot type directly due to circular import

6f4e5d6

niwilso previously approved these changes Oct 18, 2023

View reviewed changes

Merge branch 'main' into omri/new_transformers_engine

f12f4fd

omri374 dismissed niwilso’s stale review via f12f4fd October 19, 2023 07:23

removing stanza as default dependency for Pipfile

f02c65d

SharonHart approved these changes Oct 19, 2023

View reviewed changes

navalev approved these changes Oct 19, 2023

View reviewed changes

omri374 merged commit 3a7b8f6 into main Oct 19, 2023
23 checks passed

omri374 deleted the omri/new_transformers_engine branch October 19, 2023 10:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating spacy-huggingface-pipelines and refactoring NlpEngine logic #1159

Integrating spacy-huggingface-pipelines and refactoring NlpEngine logic #1159

omri374 commented Aug 28, 2023 •

edited

Loading

omri374 commented Aug 28, 2023

azure-pipelines bot commented Aug 28, 2023

omri374 commented Aug 29, 2023

azure-pipelines bot commented Aug 29, 2023

LSD-98 commented Oct 2, 2023 •

edited

Loading

omri374 commented Oct 4, 2023

omri374 commented Oct 17, 2023

azure-pipelines bot commented Oct 17, 2023

niwilso left a comment

Integrating spacy-huggingface-pipelines and refactoring NlpEngine logic #1159

Integrating spacy-huggingface-pipelines and refactoring NlpEngine logic #1159

Conversation

omri374 commented Aug 28, 2023 • edited Loading

Change Description

Note to reviewer: A PR with updates to docs can be found here #1177

Issue reference

Checklist

omri374 commented Aug 28, 2023

azure-pipelines bot commented Aug 28, 2023

omri374 commented Aug 29, 2023

azure-pipelines bot commented Aug 29, 2023

LSD-98 commented Oct 2, 2023 • edited Loading

omri374 commented Oct 4, 2023

omri374 commented Oct 17, 2023

azure-pipelines bot commented Oct 17, 2023

niwilso left a comment

Choose a reason for hiding this comment

omri374 commented Aug 28, 2023 •

edited

Loading

LSD-98 commented Oct 2, 2023 •

edited

Loading