-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transformer model may be ignoring general entity types #1463
Comments
Hi @michhar, when running the transformer model directly, I also never get locations. It does output So to get this working, I updated the # Transformer model config
tf_model_config = [
{"lang_code": "en",
"model_name": {
"spacy": "en_core_web_sm",
"transformers": "StanfordAIMI/stanford-deidentifier-base"
}
}]
# Entity mappings
mapping = dict(
PER="PERSON",
LOC="LOCATION",
ORG="ORGANIZATION",
AGE="AGE",
ID="ID",
EMAIL="EMAIL",
DATE="DATE_TIME",
PHONE="PHONE_NUMBER",
PERSON="PERSON",
LOCATION="LOCATION",
GPE="LOCATION",
ORGANIZATION="ORGANIZATION",
NORP="NRP",
PATIENT="PERSON",
STAFF="PERSON",
HOSP="LOCATION",
PATORG="ORGANIZATION",
TIME="DATE_TIME",
HCW="PERSON",
HOSPITAL="LOCATION",
FACILITY="LOCATION",
VENDOR="ORGANIZATION",
)
tf_model_configuration = NerModelConfiguration(
model_to_presidio_entity_mapping=mapping,
alignment_mode="expand", # "strict", "contract", "expand"
aggregation_strategy="max", # "simple", "first", "average", "max"
labels_to_ignore = ["O"])
tf_engine = TransformersNlpEngine(
models=tf_model_config,
ner_model_configuration=tf_model_configuration)
# Transformer-based analyzer
analyzer_tf = AnalyzerEngine(
nlp_engine=tf_engine,
supported_languages=["en"]
)
text = "Jasmine Rivers, with social security number 987-65-4321, can be reached at 555-123-4567. Her driver's license number is 1234567890. She lives at 1234 Maple Street, Anytown, USA. Jasmine works for Stellar Solutions, located at 5678 Oak Avenue, Suite 100, with 50 employees. The organization's financial health is strong, with a stock price of $100 per share and a positive forecast for the upcoming year."
res = analyzer_tf.analyze(text, language="en")
from presidio_anonymizer import AnonymizerEngine
anon = AnonymizerEngine()
anon.anonymize(text, res) This returns: text: <PERSON>, with social security number <ID>, can be reached at <PHONE_NUMBER>. Her driver's license number is <ID>. She lives at <LOCATION>. <PERSON> works for <ORGANIZATION>, located at <LOCATION>, with 50 employees. The organization's financial health is strong, with a stock price of $100 per share and a positive forecast for the upcoming year.
items:
[
{'start': 186, 'end': 196, 'entity_type': 'LOCATION', 'text': '<LOCATION>', 'operator': 'replace'},
{'start': 159, 'end': 173, 'entity_type': 'ORGANIZATION', 'text': '<ORGANIZATION>', 'operator': 'replace'},
{'start': 140, 'end': 148, 'entity_type': 'PERSON', 'text': '<PERSON>', 'operator': 'replace'},
{'start': 128, 'end': 138, 'entity_type': 'LOCATION', 'text': '<LOCATION>', 'operator': 'replace'},
{'start': 109, 'end': 113, 'entity_type': 'ID', 'text': '<ID>', 'operator': 'replace'},
{'start': 62, 'end': 76, 'entity_type': 'PHONE_NUMBER', 'text': '<PHONE_NUMBER>', 'operator': 'replace'},
{'start': 38, 'end': 42, 'entity_type': 'ID', 'text': '<ID>', 'operator': 'replace'},
{'start': 0, 'end': 8, 'entity_type': 'PERSON', 'text': '<PERSON>', 'operator': 'replace'}
] |
It could also be somewhat related to this change which wasn't released yet: #1454 |
@omri374 Thank you for your quick response! How do I go about getting the info I need to make the mapping better? Is there a good way to list out all entities supported by a model so I can make sure I map them correctly? For improving the approach (avoiding the FPs with the SpaCy model pipeline), could I exclude SpaCy model altogether in my transformer-based analyzer? |
The spacy part of the pipeline is not used for NER, but for all the other NLP tasks (tokenization, lemmatization etc.) We use Since presidio comes with spaCy's |
More on this can be found here: https://microsoft.github.io/presidio/analyzer/nlp_engines/transformers/ |
see also #1472 |
We can further consider to add a scan the model's config, to extract the id2label and show a warning in case there's a class that's not mapped and not ignored, or allow the user to just use the classes as is (create an automated mapping e.g. |
Amazing project, thank you so much for
presidio
and all of the work here.I am noticing with a few different Hugging Face transformer models, that some of the listed entities associated with the engine, are not being picked up even with mapping.
I am using the following on macOS:
For example, here is my code to set up the engine:
When I list the entities for the transformer engine with this code:
I get
Entities for transformer engine: ['DATE_TIME', 'NRP', 'EMAIL', 'PHONE_NUMBER', 'ORGANIZATION', 'AGE', 'LOCATION', 'PERSON', 'ID'] Selection deleted
Then, I do the following to call the engine/analyzer+anonymizer:
And even though I have many instances of a typical US address, I never get
LOCATION
as one of my entities. For example (btw, this is completely made up / synthetic text and does not reflect any real people or businesses!):One example output from above (including the default spacy as well since I did that, but did not show setup above, just adding for completion):
Some synthetic/made-up data:
Any thoughts on how to debug this? Thanks again!!
The text was updated successfully, but these errors were encountered: