-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrating spacy-huggingface-pipelines and refactoring NlpEngine logic #1159
Conversation
…_engine # Conflicts: # presidio-analyzer/Pipfile
…to omri/new_transformers_engine
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
…to omri/new_transformers_engine
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Hello @omri374, By running some test i got a warning from In the end, it detected the right entities to anonymize but I wondered whether it could lead to a "false negative" (i.e. skipping an entity that should be detected). I saw you also had the same issue here. I did not understand the technical points of the answer so I wanted to check if you got a solution for this. Many thanks in advance! PS : please tell me if I should post here or on the discussion tab. |
@LSD-98 thanks for raising this. This is caused by a mismatch between wordpiece tokenization (in transformers) and the spaCy tokenizer. It is defined as "expand" as usually wordpiece tokens are a subset of a spaCy token, but there could be cases where this might result in a false negative or false positive (most likely false positive, as the wordpiece token would be expanded to more than the PII itself). I don't see an immediate workaround for this, so I guess we'd have to live with it. If someone is making sure there are no alignment errors, they could still add a transformers recognizer as an independent recognizer and not as part of the NLPEngine mechanism. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great! Thank you for providing the swimlane flow diagram.
Just a few points to address before approval.
Change Description
This PR improves the handling of transformers models using the
NlpEngine
flow. Note that there are two ways to introduce NER models into presidio:NlpEngine
and a standalone recognizer. See more info on #1083.Main changes:
NlpEngine
config (eitherSpacyNlpEngine
orTransformersNlpEngine
). In the future, we could also addFlairNlpEngine
or other NER packages.NerModelConfiguration
dataclass which holds the user configuration. Configuration can be set in aconf
file, for example:with most of the configuration coming from here: https://github.com/explosion/spacy-huggingface-pipelines#token-classification
SpacyRecognizer
which now only gets the entities out ofNlpArtifacts
and returns themspacy-huggingface-pipelines
to improve the handling of transformer models inside spaCy pipelines.Flow before:
Flow after:
Note to reviewer: A PR with updates to docs can be found here #1177
Issue reference
This PR fixes issue #1083
Checklist