Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Presidio with Huggingface support #1083

Closed
Matei9721 opened this issue Jun 1, 2023 · 9 comments
Closed

Using Presidio with Huggingface support #1083

Matei9721 opened this issue Jun 1, 2023 · 9 comments

Comments

@Matei9721
Copy link

Hi, currently I am using presidio with Spacy and Stanza by creating an nlp_engine using NlpEngineProvider and passing it the correct model in the config. I was planning on adding support for HuggingFace transformer models, but I was a bit confused by the fact that there are 2 ways of doing this:

  1. Using a TransformerRecognizer
  2. Using TransformerNlpEngine

As far as I understand, if you use the recognizer then you apply the recognizer on top of the usual e.g. Spacy NER pipeline so you will get results from both Spacy and HuggingFace model. On the other hand, using the TransformerNlpEngine substitutes the Spacy NER module in the pipeline.

In this example: https://microsoft.github.io/presidio/samples/python/transformers_recognizer/ it is shown how to use the TransformersRecognizer with a specific configuration given as an example in configuration.py where you can do the MODEL_TO_PRESIDIO_MAPPING. If you are to use the TransformerNlpEngine, how are you supposed to do the mapping between model types and presidio types similar to the ones done in TransformerRecognizer?

Is my understanding above right and if yes, is there a way to create an AnalyzerEngine with a TransformerNlpEngine with the same configuration as a TransformerRecognizer?

Thanks for the help!

@Matei9721
Copy link
Author

Actually, after checking the source code more, it's actually not clear to me how one is supposed to use the TransformerNlpEngine. What is the TransformersComponent class used for in this case?

Using the TransformerRecognizer seems easier as there are more code examples, but is it advised to use it over TransformerNlpEngine?

@omri374
Copy link
Contributor

omri374 commented Jun 1, 2023

Hi @Matei9721, thanks for your feedback! I can understand why this causes confusion. We initially wanted to support Huggingface the same way we support Stanza, but bumped into some issues. In the future, the plan is to integrate the new spacy-huggingface-pipelines package for a more seamless integration.

The easiest path forward, IMHO, is to use the TransformerRecognizer in parallel to the default SpacyNlpEngine. In our demo website's code, you'll find a method which does this. It uses the small spacy model to reduce the overhead (but maintain capabilities like lemmas), and removes the SpacyRecognizer to avoid getting results from both spaCy and the transformers model. I'll paste it here too:

def create_nlp_engine_with_transformers(
    model_path: str,
) -> Tuple[NlpEngine, RecognizerRegistry]:
    """
    Instantiate an NlpEngine with a TransformersRecognizer and a small spaCy model.
    The TransformersRecognizer would return results from Transformers models, the spaCy model
    would return NlpArtifacts such as POS and lemmas.
    :param model_path: HuggingFace model path.
    """

    from transformers_rec import (
        STANFORD_COFIGURATION,
        BERT_DEID_CONFIGURATION,
        TransformersRecognizer,
    )

    registry = RecognizerRegistry()
    registry.load_predefined_recognizers()

    if not spacy.util.is_package("en_core_web_sm"):
        spacy.cli.download("en_core_web_sm")
    # Using a small spaCy model + a HF NER model
    transformers_recognizer = TransformersRecognizer(model_path=model_path)

    if model_path == "StanfordAIMI/stanford-deidentifier-base":
        transformers_recognizer.load_transformer(**STANFORD_COFIGURATION)
    elif model_path == "obi/deid_roberta_i2b2":
        transformers_recognizer.load_transformer(**BERT_DEID_CONFIGURATION)
    else:
        print(f"Warning: Model has no configuration, loading default.")
        transformers_recognizer.load_transformer(**BERT_DEID_CONFIGURATION)

    # Use small spaCy model, no need for both spacy and HF models
    # The transformers model is used here as a recognizer, not as an NlpEngine
    nlp_configuration = {
        "nlp_engine_name": "spacy",
        "models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
    }

    registry.add_recognizer(transformers_recognizer)
    registry.remove_recognizer("SpacyRecognizer")

    nlp_engine = NlpEngineProvider(nlp_configuration=nlp_configuration).create_engine()

    return nlp_engine, registry

Hope this helps. We'll work on making this easier going forward.

@Matei9721
Copy link
Author

Thank you for your swift reply @omri374 , that's exactly what I ended up following! I just wanted to make sure that I am doing it in the "best" way possible and not re-invent the wheel. :) Looking forward to the spacy-hugging face-pipeline addition as it seems to indeed streamline the process more.

I will close the issue as my questions were answered and it's clear how to approach the task now!

@LSD-98
Copy link

LSD-98 commented Aug 28, 2023

Dear @omri374 & @Matei9721 ,

Sorry to re-open this issue. The answers are really helpful.
After reviewing the demo website's code, I have the feeling that the TransformersRecognizer used here (coming from here docs/samples/python/streamlit/transformers_rec/transformers_recognizer.py) is different than the one included in the package (in the pre-defined recognizers here presidio-analyzer/presidio_analyzer/predefined_recognizers/transformers_recognizer.py).

Am I wrong and can I use the TransformersRecognizer from the pre-defined recognizers in the package in a very similar workflow as the one presented in the demo website's code ?

Thanks in advance !

@omri374
Copy link
Contributor

omri374 commented Aug 28, 2023

Hi @LSD-98, you are correct. There are essentially two flows here, and we're also about to improve the experience in the upcoming weeks, but in essence, the flows are:

  1. Use a NER model as part of the NlpEngine. This is how spaCy models are used by default. Entities are extracted during the NlpEngine phase, and passed to recognizers. the SpacyRecognizer collects those and returns a list of RecognizerResult. We extended this capability to support Huggingface/transformers models as well, which are used as part of a spaCy pipeline (see Transformers based NLP engine #887). This is where the TransformersRecognizer in the package gets into the picture. All it does is collect the entities already extracted from the model during the NlpEngine phase.
  2. In parallel, it is always possible to create new recognizers calling any model. The TransformersModel sample on the demo site and on the docs/samples follows this approach. During the call to the .analyze method, it calls the model to get the predictions. This allows the flexibility of calling 5 different models, or having models serving different languages.

In essense:
Flow 1:

sequenceDiagram
    AnalyzerEngine->>SpacyNlpEngine: Call engine.process_text(text) <br>to get model results
    SpacyNlpEngine->>NamedEntityRecognitionModel: call spaCy NER model
    NamedEntityRecognitionModel->>SpacyNlpEngine: return PII entities
    SpacyNlpEngine->>AnalyzerEngine: Pass NlpArtifacts<BR>(Entities, lemmas, tokens etc.)
    Note over AnalyzerEngine: Call all recognizers
    AnalyzerEngine->>SpacyRecognizer: Pass NlpArtifacts
    Note over SpacyRecognizer: Extract PII entities out of NlpArtifacts
    SpacyRecognizer->>AnalyzerEngine: Return List[RecognizerResult]<BR>based on entities
Loading

Flow 2:

sequenceDiagram
    Note over AnalyzerEngine: Call all recognizers, <br>including <br>MyNerModelRecognizer
    AnalyzerEngine->>MyNerModelRecognizer: call .analyze
    MyNerModelRecognizer->>transformers_model: Call transformers model
    transformers_model->>MyNerModelRecognizer: get NER/PII entities
    MyNerModelRecognizer->>AnalyzerEngine: Return List[RecognizerResult] <br>of PII entities
Loading

Where MyNerModelRecognizer is a wrapper over an NLP library, similar to the transformers example and flair example.

@omri374 omri374 reopened this Aug 29, 2023
@omri374
Copy link
Contributor

omri374 commented Aug 29, 2023

Reopening to improve logic and docs. Will be fixed in #1159

@farnazgh
Copy link

farnazgh commented Sep 7, 2023

 nlp_configuration = {
        "nlp_engine_name": "spacy",
        "models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
    }

@omri374 This seems to work for current model and for english language. However, when I want to use french HF models , I get following error.

ValueError: No matching recognizers were found to serve the request.

These are the changes I made:

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from transformers_recognizer import TransformersRecognizer
import spacy
from presidio_analyzer.nlp_engine import NlpEngineProvider

FR_MODEL_CONF = {'PRESIDIO_SUPPORTED_ENTITIES': ['LOCATION', 'PERSON', 'ORGANIZATION', 'DATE_TIME', 'NRP'],
               'DEFAULT_MODEL_PATH': 'Jean-Baptiste/camembert-ner-with-dates',
               'DATASET_TO_PRESIDIO_MAPPING': {'DATE': 'DATE_TIME', 'MISC': 'NRP', 'PER': 'PERSON', 'ORG': 'ORGANIZATION', 'LOC': 'LOCATION'},
               "MODEL_TO_PRESIDIO_MAPPING": {'DATE': 'DATE_TIME', 'MISC': 'NRP', 'PER': 'PERSON', 'ORG': 'ORGANIZATION', 'LOC': 'LOCATION'},
               "CHUNK_OVERLAP_SIZE": 40,
               "CHUNK_SIZE": 600,
               "ID_SCORE_MULTIPLIER": 0.4,
               "ID_ENTITY_NAME": "ID"}

registry = RecognizerRegistry()
registry.load_predefined_recognizers()

if not spacy.util.is_package("fr_core_news_sm"):
   spacy.cli.download("fr_core_news_sm")


supported_entities = FR_MODEL_CONF.get(
       "PRESIDIO_SUPPORTED_ENTITIES")

model = "Jean-Baptiste/camembert-ner-with-dates"
transformers_recognizerr = TransformersRecognizer(model_path=model, supported_entities= supported_entities)
transformers_recognizerr.load_transformer(**FR_MODEL_CONF)

if not spacy.util.is_package("fr_core_news_sm"):
   spacy.cli.download("fr_core_news_sm")

registry.add_recognizer(transformers_recognizerr)
registry.remove_recognizer("SpacyRecognizer")


nlp_configuration = {
   "nlp_engine_name": "spacy",
   "models": [{"lang_code": "fr", "model_name":"fr_core_news_sm"}],
}

nlp_engine = NlpEngineProvider(nlp_configuration=nlp_configuration).create_engine()

analyzer = AnalyzerEngine(registry=registry, nlp_engine=nlp_engine)

results = analyzer.analyze(
text="Je m'appelle jean-baptiste et j'habite à montréal depuis fevr 2012",
language="fr",
entities = ['LOCATION', 'PERSON', 'ORGANIZATION', 'DATE_TIME', 'NRP'],
return_decision_process=True,
)
for result in results:
   print(result)
   print(result.analysis_explanation)

@LSD-98
Copy link

LSD-98 commented Sep 7, 2023

Many thanks @omri374 for the reply, very clear.

```python
 nlp_configuration = {
        "nlp_engine_name": "spacy",
        "models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
    }

@omri374 This seems to work for current model and for english language. However, when I want to use french HF models , I get following error.

ValueError: No matching recognizers were found to serve the request.

These are the changes I made:

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from transformers_recognizer import TransformersRecognizer
import spacy
from presidio_analyzer.nlp_engine import NlpEngineProvider

FR_MODEL_CONF = {'PRESIDIO_SUPPORTED_ENTITIES': ['LOCATION', 'PERSON', 'ORGANIZATION', 'DATE_TIME', 'NRP'],
               'DEFAULT_MODEL_PATH': 'Jean-Baptiste/camembert-ner-with-dates',
               'DATASET_TO_PRESIDIO_MAPPING': {'DATE': 'DATE_TIME', 'MISC': 'NRP', 'PER': 'PERSON', 'ORG': 'ORGANIZATION', 'LOC': 'LOCATION'},
               "MODEL_TO_PRESIDIO_MAPPING": {'DATE': 'DATE_TIME', 'MISC': 'NRP', 'PER': 'PERSON', 'ORG': 'ORGANIZATION', 'LOC': 'LOCATION'},
               "CHUNK_OVERLAP_SIZE": 40,
               "CHUNK_SIZE": 600,
               "ID_SCORE_MULTIPLIER": 0.4,
               "ID_ENTITY_NAME": "ID"}

registry = RecognizerRegistry()
registry.load_predefined_recognizers()

if not spacy.util.is_package("fr_core_news_sm"):
   spacy.cli.download("fr_core_news_sm")


supported_entities = FR_MODEL_CONF.get(
       "PRESIDIO_SUPPORTED_ENTITIES")

model = "Jean-Baptiste/camembert-ner-with-dates"
transformers_recognizerr = TransformersRecognizer(model_path=model, supported_entities= supported_entities)
transformers_recognizerr.load_transformer(**FR_MODEL_CONF)

if not spacy.util.is_package("fr_core_news_sm"):
   spacy.cli.download("fr_core_news_sm")

registry.add_recognizer(transformers_recognizerr)
registry.remove_recognizer("SpacyRecognizer")


nlp_configuration = {
   "nlp_engine_name": "spacy",
   "models": [{"lang_code": "fr", "model_name":"fr_core_news_sm"}],
}

nlp_engine = NlpEngineProvider(nlp_configuration=nlp_configuration).create_engine()

analyzer = AnalyzerEngine(registry=registry, nlp_engine=nlp_engine)

results = analyzer.analyze(
text="Je m'appelle jean-baptiste et j'habite à montréal depuis fevr 2012",
language="fr",
entities = ['LOCATION', 'PERSON', 'ORGANIZATION', 'DATE_TIME', 'NRP'],
return_decision_process=True,
)
for result in results:
   print(result)
   print(result.analysis_explanation)

I tried the same thing last week and had the exact same issue. I did not manage to solve it and moved to another project. I assume there will be an easier way to use HF models when #1159 is pushed!

@omri374
Copy link
Contributor

omri374 commented Sep 7, 2023

Make sure you pass the language argument to the TransformersRecognizer

@omri374 omri374 closed this as completed Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants