diff --git a/docs/analyzer/customizing_nlp_models.md b/docs/analyzer/customizing_nlp_models.md index 3e67934c4..5969e86b1 100644 --- a/docs/analyzer/customizing_nlp_models.md +++ b/docs/analyzer/customizing_nlp_models.md @@ -1,11 +1,11 @@ -# Customizing the NLP models in Presidio Analyzer - -Presidio uses NLP engines for two main tasks: NER based PII identification, -and feature extraction for custom rule based logic (such as leveraging context words for improved detection). -While Presidio comes with an open-source model (the `en_core_web_lg` model from spaCy), -it can be customized by leveraging other NLP models, either public or proprietary. -These models can be trained or downloaded from existing NLP frameworks like [spaCy](https://spacy.io/usage/models), -[Stanza](https://github.com/stanfordnlp/stanza) and +# Customizing the NLP engine in Presidio Analyzer + +Presidio uses NLP engines for two main tasks: NER based PII identification, +and feature extraction for downstream rule based logic (such as leveraging context words for improved detection). +While Presidio comes with an open-source model (the `en_core_web_lg` model from spaCy), +additional NLP models and frameworks could be plugged in, either public or proprietary. +These models can be trained or downloaded from existing NLP frameworks like [spaCy](https://spacy.io/usage/models), +[Stanza](https://github.com/stanfordnlp/stanza) and [transformers](https://github.com/huggingface/transformers). In addition, other types of NLP frameworks [can be integrated into Presidio](developing_recognizers.md#machine-learning-ml-based-or-rule-based). @@ -63,9 +63,30 @@ Configuration can be done in two ways: - lang_code: es model_name: es_core_news_md + ner_model_configuration: + labels_to_ignore: + - O + model_to_presidio_entity_mapping: + PER: PERSON + LOC: LOCATION + ORG: ORGANIZATION + AGE: AGE + ID: ID + DATE: DATE_TIME + low_confidence_score_multiplier: 0.4 + low_score_entity_names: + - ID + - ORG ``` - The default conf file is read during the default initialization of the `AnalyzerEngine`. Alternatively, the path to a custom configuration file can be passed to the `NlpEngineProvider`: + The `ner_model_configuration` section contains the following parameters: + + - `labels_to_ignore`: A list of labels to ignore. For example, `O` (no entity) or entities you are not interested in returning. + - `model_to_presidio_entity_mapping`: A mapping between the transformers model labels and the Presidio entity types. + - `low_confidence_score_multiplier`: A multiplier to apply to the score of entities with low confidence. + - `low_score_entity_names`: A list of entity types to apply the low confidence score multiplier to. + + The [default conf file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/conf/default.yaml) is read during the default initialization of the `AnalyzerEngine`. Alternatively, the path to a custom configuration file can be passed to the `NlpEngineProvider`: ```python from presidio_analyzer import AnalyzerEngine, RecognizerRegistry @@ -97,12 +118,14 @@ Configuration can be done in two ways: c. pass requests in each of these languages. !!! note "Note" - Presidio can currently use one NLP model per language. + Presidio can currently use one NER model per language via the `NlpEngine`. If multiple are required, + consider wrapping NER models as additional recognizers ([see sample here](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py)). ## Leverage frameworks other than spaCy, Stanza and transformers for ML based PII detection In addition to the built-in spaCy/Stanza/transformers capabitilies, it is possible to create new recognizers which serve as interfaces to other models. For more information: + - [Remote recognizer documentation](adding_recognizers.md#creating-a-remote-recognizer) and [samples](../samples/python/integrating_with_external_services.ipynb). - [Flair recognizer example](../samples/python/flair_recognizer.py) diff --git a/docs/analyzer/developing_recognizers.md b/docs/analyzer/developing_recognizers.md index 546c0ce35..3772867ce 100644 --- a/docs/analyzer/developing_recognizers.md +++ b/docs/analyzer/developing_recognizers.md @@ -7,7 +7,8 @@ Recognizers define the logic for detection, as well as the confidence a predicti ### Accuracy -Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system. A recognizer with many false positives would affect the system's usability, while a recognizer with many false negatives might require more work before it can be integrated. For reproducibility purposes, it is be best to note how the recognizer's accuracy was tested, and on which datasets. +Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system. +A recognizer with many false positives would affect the system's usability, while a recognizer with many false negatives might require more work before it can be integrated. For reproducibility purposes, it is be best to note how the recognizer's accuracy was tested, and on which datasets. For tools and documentation on evaluating and analyzing recognizers, refer to the [presidio-research Github repository](https://github.com/microsoft/presidio-research). !!! note "Note" @@ -22,7 +23,8 @@ Make sure your recognizer doesn't take too long to process text. Anything above ### Environment -When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies. In the case of a conflict, one can create an isolated model environment (outside the main presidio-analyzer process) and implement a [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) on the presidio-analyzer side to interact with the model's endpoint. In addition, make sure the license on the 3rd party dependency allows you to use it for any purpose. +When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies. +In the case of a conflict, one can create an isolated model environment (outside the main presidio-analyzer process) and implement a [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) on the presidio-analyzer side to interact with the model's endpoint. ## Recognizer Types @@ -32,7 +34,7 @@ Generally speaking, there are three types of recognizers: A deny list is a list of words that should be removed during text analysis. For example, it can include a list of titles (`["Mr.", "Mrs.", "Ms.", "Dr."]` to detect a "Title" entity.) -See [this documentation](index.md#how-to-add-a-new-recognizer) on adding a new recognizer. The [`PatternRecognizer`](/presidio-analyzer/presidio_analyzer/pattern_recognizer.py) class has built-in support for a deny-list input. +See [this documentation](index.md#how-to-add-a-new-recognizer) on adding a new recognizer. The [`PatternRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/pattern_recognizer.py) class has built-in support for a deny-list input. ### Pattern Based @@ -47,36 +49,26 @@ See some examples here: ### Machine Learning (ML) Based or Rule-Based Many PII entities are undetectable using naive approaches like deny-lists or regular expressions. -In these cases, we would wish to utilize a Machine Learning model capable of identifying entities in free text, or a rule-based recognizer. There are four options for adding ML and rule based recognizers: +In these cases, we would wish to utilize a Machine Learning model capable of identifying entities in free text, or a rule-based recognizer. -#### Utilize SpaCy or Stanza +#### ML: Utilize SpaCy, Stanza or Transformers -Presidio currently uses [spaCy](https://spacy.io/) as a framework for text analysis and Named Entity Recognition (NER), and [stanza](https://stanfordnlp.github.io/stanza/) as an alternative. To avoid introducing new tools, it is recommended to first try to use `spaCy` or `stanza` over other tools if possible. +Presidio currently uses [spaCy](https://spacy.io/) as a framework for text analysis and Named Entity Recognition (NER), and [stanza](https://stanfordnlp.github.io/stanza/) and [huggingface transformers](https://huggingface.co/docs/transformers/index) as an alternative. To avoid introducing new tools, it is recommended to first try to use `spaCy`, `stanza` or `transformers` over other tools if possible. `spaCy` provides descent results compared to state-of-the-art NER models, but with much better computational performance. -`spaCy` and `stanza` models could be trained from scratch, used in combination with pre-trained embeddings, or retrained to detect new entities. -When integrating such a model into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created. +`spaCy`, `stanza` and `transformers` models could be trained from scratch, used in combination with pre-trained embeddings, or be fine-tuned. -#### Utilize Scikit-learn or Similar - -`Scikit-learn` models tend to be fast, but usually have lower accuracy than deep learning methods. However, for well defined problems with well defined features, they can provide very good results. -When integrating such a model into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created. +In addition to those, it is also possible to use other ML models. In that case, a new `EntityRecognizer` should be created. +See an example using [Flair here](https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py). #### Apply Custom Logic -In some cases, rule-based logic provides the best way of detecting entities. -The Presidio `EntityRecognizer` API allows you to use `spaCy`/`stanza` extracted features like lemmas, part of speech, dependencies and more to create your logic. When integrating such logic into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created. - -#### Deep Learning Based Methods - -Deep learning methods offer excellent detection rates for NER. -They are however more complex to train, deploy and tend to be slower than traditional approaches. -When creating a DL based method for PII detection, there are two main alternatives for integrating it with Presidio: - -1. Create an external endpoint (either local or remote) which is isolated from the `presidio-analyzer` process. On the `presidio-analyzer` side, one would extend the [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) class and implement the network interface between `presidio-analyzer` and the endpoint of the model's container. -2. Integrate the model as an additional [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) within the `presidio-analyzer` flow. +In some cases, rule-based logic provides reasonable ways for detecting entities. +The Presidio `EntityRecognizer` API allows you to use `spaCy` extracted features like lemmas, part of speech, dependencies and more to create your logic. +When integrating such logic into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created. !!! attention "Considerations for selecting one option over another" + - Accuracy. - Ease of integration. - Runtime considerations (For example if the new model requires a GPU). - 3rd party dependencies of the new model vs. the existing `presidio-analyzer` package. diff --git a/docs/analyzer/index.md b/docs/analyzer/index.md index 3a98f8cb2..6412834ad 100644 --- a/docs/analyzer/index.md +++ b/docs/analyzer/index.md @@ -14,42 +14,7 @@ Named Entity Recognition and other types of logic to detect PII in unstructured ## Installation -=== "Using pip" - - !!! note "Note" - Consider installing the Presidio python packages on a virtual environment like venv or conda. - - To get started with Presidio-analyzer, - download the package and the `en_core_web_lg` spaCy model: - - ```sh - pip install presidio-analyzer - python -m spacy download en_core_web_lg - ``` - -=== "Using Docker" - - !!! note "Note" - This requires Docker to be installed. [Download Docker](https://docs.docker.com/get-docker/). - - ```sh - # Download image from Dockerhub - docker pull mcr.microsoft.com/presidio-analyzer - - # Run the container with the default port - docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest - ``` - -=== "From source" - - First, clone the Presidio repo. [See here for instructions](../installation.md#install-from-source). - - Then, build the presidio-analyzer container: - - ```sh - cd presidio-analyzer - docker build . -t presidio/presidio-analyzer - ``` +see [Installing Presidio](../installation.md). ## Getting started diff --git a/docs/analyzer/languages.md b/docs/analyzer/languages.md index aee03bcec..7d51dcd17 100644 --- a/docs/analyzer/languages.md +++ b/docs/analyzer/languages.md @@ -64,6 +64,7 @@ analyzer = AnalyzerEngine( analyzer.analyze(text="My name is David", language="en") ``` +Link to LANGUAGES_CONFIG_FILE=[languages-config.yml](https://github.com/microsoft/presidio/blob/main/docs/analyzer/languages-config.yml) ### Automatically install NLP models into the Docker container @@ -73,4 +74,4 @@ update the [conf/default.yaml](https://github.com/microsoft/presidio/blob/main/p the `docker build` phase and the models defined in it are installed automatically. For `transformers` based models, the configuration [can be found here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/conf/transformers.yaml). -In addition, make sure the Docker file contains the relevant packages for `transformers`, which are not loaded automatically with Presidio. +A docker file supporting transformers models [can be found here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/Dockerfile.transformers). diff --git a/docs/analyzer/nlp_engines/spacy_stanza.md b/docs/analyzer/nlp_engines/spacy_stanza.md index c7e6e9fc8..d0372570f 100644 --- a/docs/analyzer/nlp_engines/spacy_stanza.md +++ b/docs/analyzer/nlp_engines/spacy_stanza.md @@ -30,11 +30,26 @@ For the available models, follow these links: [spaCy](https://spacy.io/usage/mod !!! tip "Tip" For Person, Location and Organization detection, it could be useful to try out the transformers based models (e.g. `en_core_web_trf`) which uses a more modern deep-learning architecture, but is generally slower than the default `en_core_web_lg` model. - ### Configure Presidio to use the pre-trained model Once created, see [the NLP configuration documentation](../customizing_nlp_models.md#Configure-Presidio-to-use-the-new-model) for more information. +## How NER results flow within Presidio +This diagram describes the flow of NER results within Presidio, and the relationship between the `SpacyNlpEngine` component and the `SpacyRecognizer` component: +```mermaid +sequenceDiagram + AnalyzerEngine->>SpacyNlpEngine: Call engine.process_text(text)
to get model results + SpacyNlpEngine->>spaCy: Call spaCy pipeline + spaCy->>SpacyNlpEngine: return entities and other attributes + Note over SpacyNlpEngine: Map entity names to Presidio's,
update scores,
remove unwanted entities
based on NerModelConfiguration + SpacyNlpEngine->>AnalyzerEngine: Pass NlpArtifacts
(Entities, lemmas, tokens, scores etc.) + Note over AnalyzerEngine: Call all recognizers + AnalyzerEngine->>SpacyRecognizer: Pass NlpArtifacts + Note over SpacyRecognizer: Extract PII entities out of NlpArtifacts + SpacyRecognizer->>AnalyzerEngine: Return List[RecognizerResult] + +``` + ## Training your own model !!! note "Note" diff --git a/docs/analyzer/nlp_engines/transformers.md b/docs/analyzer/nlp_engines/transformers.md index 89a8a9f37..bee44ea89 100644 --- a/docs/analyzer/nlp_engines/transformers.md +++ b/docs/analyzer/nlp_engines/transformers.md @@ -4,11 +4,26 @@ Presidio's `TransformersNlpEngine` consists of a spaCy pipeline which encapsulat ![image](../../assets/spacy-transformers-ner.png) -Presidio leverages other types of information from spaCy such as tokens, lemmas and part-of-speech. +Presidio leverages other types of information from spaCy such as tokens, lemmas and part-of-speech. Therefore the pipeline returns both the NER model results as well as results from other pipeline components. -!!! warning "Warning" - spaCy and transformers use a different tokenization approach. Therefore, it could be that there is no alignment between the spans identified by a transformers model and the spans created by spaCy. In this cases, there could be cases where the output of the transformers model is different from the output of Presidio's `TransformersNlpEngine` +## How NER results flow within Presidio +This diagram describes the flow of NER results within Presidio, and the relationship between the `TransformersNlpEngine` component and the `TransformersRecognizer` component: +```mermaid +sequenceDiagram + AnalyzerEngine->>TransformersNlpEngine: Call engine.process_text(text)
to get model results + TransformersNlpEngine->>spaCy: Call spaCy pipeline + spaCy->>transformers: call NER model + transformers->>spaCy: get entities + spaCy->>TransformersNlpEngine: return transformers entities
+ spaCy attributes + Note over TransformersNlpEngine: Map entity names to Presidio's,
update scores,
remove unwanted entities
based on NerModelConfiguration + TransformersNlpEngine->>AnalyzerEngine: Pass NlpArtifacts
(Entities, lemmas, tokens, scores etc.) + Note over AnalyzerEngine: Call all recognizers + AnalyzerEngine->>TransformersRecognizer: Pass NlpArtifacts + Note over TransformersRecognizer: Extract PII entities out of NlpArtifacts + TransformersRecognizer->>AnalyzerEngine: Return List[RecognizerResult] + +``` ## Adding a new model @@ -17,6 +32,7 @@ As the underlying transformers model, you can choose from either a public pretra ### Using a public pre-trained transformers model #### Downloading a pre-trained model + To download the desired NER model from HuggingFace: ```python @@ -34,30 +50,99 @@ AutoModelForTokenClassification.from_pretrained(transformers_model) ``` Then, also download a spaCy pipeline/model: + ```sh python -m spacy download en_core_web_sm ``` #### Creating a configuration file -Once the models are downloaded, the easiest option would be to create a YAML configuration file. -Note that this file needs to contain both a `spaCy` pipeline name and a transformers model name: + +Once the models are downloaded, one option to configure them is to create a YAML configuration file. +Note that the configuration needs to contain both a `spaCy` pipeline name and a transformers model name. +In addition, different configurations for parsing the results of the transformers model can be added. + +Example configuration (in YAML): ```yaml nlp_engine_name: transformers models: -- -lang_code: en -model_name: - spacy: - transformers: + - + lang_code: en + model_name: + spacy: en_core_web_sm + transformers: StanfordAIMI/stanford-deidentifier-base + +ner_model_configuration: + labels_to_ignore: + - O + aggregation_strategy: simple # "simple", "first", "average", "max" + stride: 16 + alignment_mode: strict # "strict", "contract", "expand" + model_to_presidio_entity_mapping: + PER: PERSON + LOC: LOCATION + ORG: ORGANIZATION + AGE: AGE + ID: ID + EMAIL: EMAIL + PATIENT: PERSON + STAFF: PERSON + HOSP: ORGANIZATION + PATORG: ORGANIZATION + DATE: DATE_TIME + PHONE: PHONE_NUMBER + HCW: PERSON + HOSPITAL: ORGANIZATION + + low_confidence_score_multiplier: 0.4 + low_score_entity_names: + - ID ``` - + Where: -- `` is a name of a spaCy model/pipeline, which would wrap the transformers NER model. For example, `en_core_web_sm`. -- The `` is the full path for a huggingface model. Models can be found on [HuggingFace Models Hub](https://huggingface.co/models?pipeline_tag=token-classification). For example, `obi/deid_roberta_i2b2` +- `model_name.spacy` is a name of a spaCy model/pipeline, which would wrap the transformers NER model. For example, `en_core_web_sm`. +- The `model_name.transformers` is the full path for a huggingface model. Models can be found on [HuggingFace Models Hub](https://huggingface.co/models?pipeline_tag=token-classification). For example, `obi/deid_roberta_i2b2` + +The `ner_model_configuration` section contains the following parameters: + +- `labels_to_ignore`: A list of labels to ignore. For example, `O` (no entity) or entities you are not interested in returning. +- `aggregation_strategy`: The strategy to use when aggregating the results of the transformers model. +- `stride`: The value is the length of the window overlap in transformer tokenizer tokens. +- `alignment_mode`: The strategy to use when aligning the results of the transformers model to the original text. +- `model_to_presidio_entity_mapping`: A mapping between the transformers model labels and the Presidio entity types. +- `low_confidence_score_multiplier`: A multiplier to apply to the score of entities with low confidence. +- `low_score_entity_names`: A list of entity types to apply the low confidence score multiplier to. + +See more information on parameters on the [spacy-huggingface-pipelines Github repo](https://github.com/explosion/spacy-huggingface-pipelines#token-classification). + Once created, see [the NLP configuration documentation](../customizing_nlp_models.md#Configure-Presidio-to-use-the-new-model) for more information. +#### Calling the new model + +Once the configuration file is created, it can be used to create a new `TransformersNlpEngine`: + +```python + from presidio_analyzer import AnalyzerEngine, RecognizerRegistry + from presidio_analyzer.nlp_engine import NlpEngineProvider + + # Create configuration containing engine name and models + conf_file = PATH_TO_CONF_FILE + + # Create NLP engine based on configuration + provider = NlpEngineProvider(conf_file=conf_file) + nlp_engine = provider.create_engine() + + # Pass the created NLP engine and supported_languages to the AnalyzerEngine + analyzer = AnalyzerEngine( + nlp_engine=nlp_engine, + supported_languages=["en"] + ) + + results_english = analyzer.analyze(text="My name is Morris", language="en") + print(results_english) +``` + ### Training your own model !!! note "Note" @@ -66,3 +151,8 @@ Once created, see [the NLP configuration documentation](../customizing_nlp_model For more information on model training and evaluation for Presidio, see the [Presidio-Research Github repository](https://github.com/microsoft/presidio-research). To train your own model, see this tutorial: [Train your own transformers model](https://huggingface.co/docs/transformers/training). + +### Using a transformers model as an `EntityRecognizer` + +In addition to the approach described in this document, one can decide to integrate a transformers model as a recognizer. +We allow these two options, as a user might want to have multiple NER models running in parallel. In this case, one can create multiple `EntityRecognizer` instances, each serving a different model, instead of one model used in an `NlpEngine`. [See this sample](../../samples/python/transformers_recognizer/index.md) for more info on integrating a transformers model as a Presidio recognizer and not as a Presidio `NLPEngine`. diff --git a/docs/anonymizer/index.md b/docs/anonymizer/index.md index 78a014508..b0c272a34 100644 --- a/docs/anonymizer/index.md +++ b/docs/anonymizer/index.md @@ -17,40 +17,7 @@ with some other value by applying a certain operator (e.g. replace, mask, redact ## Installation -=== "Using pip" - - !!! note "Note" - Consider installing the Presidio python packages on a virtual environment like venv or conda. - - To install Presidio Anonymizer, run: - - ```sh - pip install presidio-anonymizer - ``` - -=== "Using Docker" - - !!! note "Note" - This requires Docker to be installed. [Download Docker](https://docs.docker.com/get-docker/). - - ```sh - # Download image from Dockerhub - docker pull mcr.microsoft.com/presidio-anonymizer - - # Run the container with the default port - docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest - ``` - -=== "From source" - - First, clone the Presidio repo. [See here for instructions](../installation.md#install-from-source). - - Then, build the presidio-anonymizer container: - - ```sh - cd presidio-anonymizer - docker build . -t presidio/presidio-anonymizer - ``` +see [Installing Presidio](../installation.md). ## Getting started diff --git a/docs/api/analyzer_python.md b/docs/api/analyzer_python.md index 4267add11..9e0665a22 100644 --- a/docs/api/analyzer_python.md +++ b/docs/api/analyzer_python.md @@ -2,5 +2,5 @@ ::: presidio_analyzer handler: python - selection: - docstring_style: sphinx \ No newline at end of file + options: + docstring_style: sphinx diff --git a/docs/api/anonymizer_python.md b/docs/api/anonymizer_python.md index bf0b42832..f59ee1255 100644 --- a/docs/api/anonymizer_python.md +++ b/docs/api/anonymizer_python.md @@ -2,5 +2,5 @@ ::: presidio_anonymizer handler: python - selection: + options: docstring_style: sphinx diff --git a/docs/api/image_redactor_python.md b/docs/api/image_redactor_python.md index 2eb5290b6..33aa583ad 100644 --- a/docs/api/image_redactor_python.md +++ b/docs/api/image_redactor_python.md @@ -1,15 +1,6 @@ # Presidio Image Redactor API Reference -## ImageRedactorEngine class - -::: presidio_image_redactor.ImageRedactorEngine - handler: python - selection: - docstring_style: sphinx - -## ImageAnalyzerEngine class - -::: presidio_image_redactor.ImageAnalyzerEngine +::: presidio_image_redactor handler: python - selection: + options: docstring_style: sphinx diff --git a/docs/faq.md b/docs/faq.md index 37e3afafe..113230ad4 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -1,26 +1,27 @@ # Frequently Asked Questions (FAQ) - [General](#general) - - [What is Presidio?](#what-is-presidio) - - [Why did Microsoft create Presidio?](#why-did-microsoft-create-presidio) - - [Is Microsoft Presidio an official Microsoft product?](#is-microsoft-presidio-an-official-microsoft-product) - - [What is the difference between Presidio and different PII detection services like Azure Text Analytics and Amazon Comprehend?](#what-is-the-difference-between-presidio-and-different-pii-detection-services-like-azure-text-analytics-and-amazon-comprehend) + - [What is Presidio?](#what-is-presidio) + - [Why did Microsoft create Presidio?](#why-did-microsoft-create-presidio) + - [Is Microsoft Presidio an official Microsoft product?](#is-microsoft-presidio-an-official-microsoft-product) + - [What is the difference between Presidio and different PII detection services like Azure Text Analytics and Amazon Comprehend?](#what-is-the-difference-between-presidio-and-different-pii-detection-services-like-azure-text-analytics-and-amazon-comprehend) - [Using Presidio](#using-presidio) - - [How can I start using Presidio?](#how-can-i-start-using-presidio) - - [What are the main building blocks in Presidio?](#what-are-the-main-building-blocks-in-presidio) + - [How can I start using Presidio?](#how-can-i-start-using-presidio) + - [What are the main building blocks in Presidio?](#what-are-the-main-building-blocks-in-presidio) - [Customizing Presidio](#customizing-presidio) - - [How can Presidio be customized to my needs?](#how-can-presidio-be-customized-to-my-needs) - - [What NLP frameworks does Presidio support?](#what-nlp-frameworks-does-presidio-support) - - [Can Presidio be used for Pseudonymization?](#can-presidio-be-used-for-pseudonymization) - - [Does Presidio work on structured/tabular data?](#does-presidio-work-on-structuredtabular-data) + - [How can Presidio be customized to my needs?](#how-can-presidio-be-customized-to-my-needs) + - [What NLP frameworks does Presidio support?](#what-nlp-frameworks-does-presidio-support) + - [Can Presidio be used for Pseudonymization?](#can-presidio-be-used-for-pseudonymization) + - [Does Presidio work on structured/tabular data?](#does-presidio-work-on-structuredtabular-data) - [Improving detection accuracy](#improving-detection-accuracy) - - [What can I do if Presidio does not detect some of the PII entities in my data (False Negatives)?](#what-can-i-do-if-presidio-does-not-detect-some-of-the-pii-entities-in-my-data-false-negatives) - - [What can I do if Presidio falsely detects text as PII entities (False Positives)?](#what-can-i-do-if-presidio-falsely-detects-text-as-pii-entities-false-positives) - - [How can I evaluate the performance of my Presidio instance?](#how-can-i-evaluate-the-performance-of-my-presidio-instance) + - [What can I do if Presidio does not detect some of the PII entities in my data (False Negatives)?](#what-can-i-do-if-presidio-does-not-detect-some-of-the-pii-entities-in-my-data-false-negatives) + - [What can I do if Presidio falsely detects text as PII entities (False Positives)?](#what-can-i-do-if-presidio-falsely-detects-text-as-pii-entities-false-positives) + - [How can I evaluate the performance of my Presidio instance?](#how-can-i-evaluate-the-performance-of-my-presidio-instance) - [Deployment](#deployment) - - [How can I deploy Presidio into my environment?](#how-can-i-deploy-presidio-into-my-environment) + - [How can I deploy Presidio into my environment?](#how-can-i-deploy-presidio-into-my-environment) - [Contributing](#contributing) - - [How can I contribute to Presidio?](#how-can-i-contribute-to-presidio) + - [How can I contribute to Presidio?](#how-can-i-contribute-to-presidio) + - [How can I report security vulnerabilities?](#how-can-i-report-security-vulnerabilities) ## General @@ -44,7 +45,7 @@ By developing Presidio, our goals are: ### Is Microsoft Presidio an official Microsoft product? -The authors and maintainers of Presidio come from the [Commercial Software Engineering]([https://microsoft/github.io/code-with-engineering-playbook/cse](https://microsoft.github.io/code-with-engineering-playbook/CSE/)) team. We work with customers on various engineering problems, and have found the proper handling of private and sensitive data a recurring challenge across many customers and industries. +The authors and maintainers of Presidio come from the [Industry Solutions Engineering](https://microsoft.github.io/code-with-engineering-playbook) team. We work with customers on various engineering problems, and have found the proper handling of private and sensitive data a recurring challenge across many customers and industries. !!! note "Note" Microsoft Presidio is not an official Microsoft product. Usage terms are defined in the [repository's license](https://github.com/microsoft/presidio/blob/main/LICENSE). @@ -94,11 +95,11 @@ For more information, see the [docs](https://microsoft.github.io/presidio/analyz ### Can Presidio be used for Pseudonymization? -Pseudonymization is a de-identification technique in which the real data is replaced with fake data. Since there are various ways and approaches for this, we provide a simple [sample](https://microsoft.github.io/presidio/samples/python/example_custom_lambda_anonymizer/) which can be extended for more sophisticated usage. If you have a question or a request on this topic, please open an issue on the repo. +Pseudonymization is a de-identification technique in which the real data is replaced with fake data in a reversible way. Since there are various ways and approaches for this, we provide a simple [sample](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_custom_lambda_anonymizer.py) which can be extended for more sophisticated usage. If you have a question or a request on this topic, please open an issue on the repo. ### Does Presidio work on structured/tabular data? -This is an area we are actively looking into. We have an [example implementation](https://microsoft.github.io/presidio/samples/python/batch_processing/) of using Presidio on structured/semi-structured data. Also see the different discussions on this topic on the [Discussions](https://github.com/microsoft/presidio/discussions) section. If you have a question, suggestion, or a contribution in this area, please reach out by opening an issue, starting a discussion or reaching us directly at presidio@microsoft.com +This is an area we are actively looking into. We have an [example implementation](https://microsoft.github.io/presidio/samples/python/batch_processing/) of using Presidio on structured/semi-structured data. Also see the different discussions on this topic on the [Discussions](https://github.com/microsoft/presidio/discussions) section. If you have a question, suggestion, or a contribution in this area, please reach out by opening an issue, starting a discussion or reaching us directly at ## Improving detection accuracy @@ -133,7 +134,8 @@ The main Presidio modules (analyzer, anonymizer, image-redactor) can be used bot ### How can I contribute to Presidio? -First, review the [contribution guidelines](https://github.com/microsoft/presidio/blob/main/CONTRIBUTING.md), and feel free to reach out by opening an issue, posting a discussion or emailing us at presidio@microsoft.com +First, review the [contribution guidelines](https://github.com/microsoft/presidio/blob/main/CONTRIBUTING.md), and feel free to reach out by opening an issue, posting a discussion or emailing us at + +### How can I report security vulnerabilities? -### How can I report security vulnerabilities? Please see the [security information](https://github.com/microsoft/presidio/blob/main/SECURITY.md). diff --git a/docs/getting_started.md b/docs/getting_started.md index 49def7a26..2339bc79c 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -2,9 +2,10 @@ ## Simple flow -Using Presidio's modules as Python packages to get started +Using Presidio's modules as Python packages to get started: + +===+ "Anonymize PII in text (Default spaCy model)" -=== "Anonymize PII in text" 1. Install Presidio @@ -41,6 +42,56 @@ Using Presidio's modules as Python packages to get started print(anonymized_text) ``` +=== "Anonymize PII in text (transformers)" + + 1. Install Presidio + + ```sh + pip install "presidio-analyzer[transformers]" + pip install presidio-anonymizer + python -m spacy download en_core_web_sm + ``` + + 2. Analyze + Anonymize + + ```py + from presidio_analyzer import AnalyzerEngine + from presidio_analyzer.nlp_engine import TransformersNlpEngine + from presidio_anonymizer import AnonymizerEngine + + text = "My name is Don and my phone number is 212-555-5555" + + # Define which transformers model to use + model_config = [{"lang_code": "en", "model_name": { + "spacy": "en_core_web_sm", # use a small spaCy model for lemmas, tokens etc. + "transformers": "dslim/bert-base-NER" + } + }] + + nlp_engine = TransformersNlpEngine(models=model_config) + + # Set up the engine, loads the NLP module (spaCy model by default) + # and other PII recognizers + analyzer = AnalyzerEngine(nlp_engine=nlp_engine) + + # Call analyzer to get results + results = analyzer.analyze(text=text, language='en') + print(results) + + # Analyzer results are passed to the AnonymizerEngine for anonymization + + anonymizer = AnonymizerEngine() + + anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results) + + print(anonymized_text) + + ``` + !!! tip "Tip: Downloading models" + If not available, the transformers model and the spacy model would be downloaded on the first call to the `AnalyzerEngine`. To pre-download, see [this doc](./analyzer/nlp_engines/transformers.md#downloading-a-pre-trained-model). + +## Simple flow: Images + === "Anonymize PII in images" 1. Install presidio-image-redactor diff --git a/docs/index.md b/docs/index.md index 50a098a4a..3c7c1ae1c 100644 --- a/docs/index.md +++ b/docs/index.md @@ -46,12 +46,12 @@ bitcoin wallets, US phone numbers, financial data and more. ## Running Presidio -1. [Running Presidio via code](samples/python/index.md) +1. [Samples for running Presidio via code](samples/index.md) 2. [Running Presidio as an HTTP service](samples/docker/index.md) 3. [Setting up a development environment](development.md) 4. [Perform PII identification using presidio-analyzer](analyzer/index.md) -5. [Perform PII anonymization using presidio-anonymizer](anonymizer/index.md) -6. [Perform PII identification and anonymization in images using presidio-image-redactor](image-redactor/index.md) +5. [Perform PII de-identification using presidio-anonymizer](anonymizer/index.md) +6. [Perform PII identification and redaction in images using presidio-image-redactor](image-redactor/index.md) 7. [Example deployments](samples/deployments/index.md) --- diff --git a/docs/installation.md b/docs/installation.md index dcaf66b83..8a0b6fb01 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -2,17 +2,16 @@ ## Description -This document describes how to download and install the Presidio services locally. -As Presidio is comprised of several packages/services, -this document describes the installation of the entire +This document describes the installation of the entire Presidio suite using `pip` (as Python packages) or using `Docker` (As containerized services). ## Using pip !!! note "Note" - Consider installing the Presidio python packages - on a virtual environment like [venv](https://docs.python.org/3/tutorial/venv.html) - or [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html). + + Consider installing the Presidio python packages + in a virtual environment like [venv](https://docs.python.org/3/tutorial/venv.html) + or [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html). ### Supported Python Versions @@ -26,20 +25,42 @@ Presidio is supported for the following python versions: ### PII anonymization on text -For PII anonymization on text, install the `presidio-analyzer` and `presidio-anonymizer` packages: +For PII anonymization on text, install the `presidio-analyzer` and `presidio-anonymizer` packages +with at least one NLP engine (`spaCy`, `transformers` or `stanza`): -```sh -pip install presidio_analyzer -pip install presidio_anonymizer +===+ "spaCy (default)" -# Presidio analyzer requires a spaCy language model. -python -m spacy download en_core_web_lg -``` + ``` + pip install presidio_analyzer + pip install presidio_anonymizer + python -m spacy download en_core_web_lg + ``` + +=== "Transformers" -For a more detailed installation of each package, refer to the specific documentation: + ``` + pip install "presidio_analyzer[transformers]" + pip install presidio_anonymizer + python -m spacy download en_core_web_sm + ``` -* [presidio-analyzer](analyzer/index.md). -* [presidio-anonymizer](anonymizer/index.md). + !!! note "Note" + + When using a transformers NLP engine, Presidio would still use spaCy for other capabilities, + therefore a small spaCy model (such as en_core_web_sm) is required. + Transformers models would be loaded lazily. To pre-load them, see: [Downloading a pre-trained model](./analyzer/nlp_engines/transformers.md#downloading-a-pre-trained-model) + +=== "Stanza" + + ``` + pip install "presidio_analyzer[stanza]" + pip install presidio_anonymizer + ``` + + + !!! note "Note" + + Stanza models would be loaded lazily. To pre-load them, see: [Downloading a pre-trained model](./analyzer/nlp_engines/spacy_stanza.md#download-the-pre-trained-model). ### PII redaction in images @@ -53,8 +74,6 @@ pip install presidio_image_redactor python -m spacy download en_core_web_lg ``` -[Click here](image-redactor/index.md) for more information on the presidio-image-redactor package. - ## Using Docker Presidio can expose REST endpoints for each service using Flask and Docker. diff --git a/docs/samples/index.md b/docs/samples/index.md index c46e5fdfa..3d7f462e1 100644 --- a/docs/samples/index.md +++ b/docs/samples/index.md @@ -1,29 +1,30 @@ # Samples -| Topic | Type | Sample | -| :---------- |:--------------------------------------| :---------------------------------------------------------------------------------------------------------------------------------------------- | -| Usage | Python Notebook | [Presidio Basic Usage Notebook](python/presidio_notebook.ipynb) | -| Usage | Python Notebook | [Customizing Presidio Analyzer](python/customizing_presidio_analyzer.ipynb) | -| Usage | Python Notebook | [Analyzing structured / semi-structured data in batch](python/batch_processing.ipynb)| -| Usage | Python Notebook | [Encrypting and Decrypting identified entities](python/encrypt_decrypt.ipynb)| -| Usage | Python Notebook | [Getting the identified entity value using a custom Operator](python/getting_entity_values.ipynb)| -| Usage | Python Notebook | [Anonymizing known values](https://github.com/microsoft/presidio/blob/main/docs/samples/python/Anonymizing%20known%20values.ipynb) -| Usage | Python Notebook | [Redacting text PII from DICOM images](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_dicom_image_redactor.ipynb) -| Usage | Python Notebook | [Using an allow list with image redaction](https://github.com/microsoft/presidio/blob/main/docs/samples/python/image_redaction_allow_list_approach.ipynb) -| Usage | Python Notebook | [Annotating PII in a PDF](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_pdf_annotation.ipynb) -| Usage | Python Notebook | [Integrating with external services](https://github.com/microsoft/presidio/blob/main/docs/samples/python/integrating_with_external_services.ipynb) | -| Usage | Python | [Remote Recognizer](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py) | -| Usage | Python | [Text Analytics as a Remote Recognizer](https://github.com/microsoft/presidio/blob/main/docs/samples/python/text_analytics/index.md) | -| Usage | Python | [Analyze and Anonymize CSV file](https://github.com/microsoft/presidio/blob/main/docs/samples/python/process_csv_file.py) | -| Usage | Python | [Using Flair as an external PII model](https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py)| -| Usage | Python | [Using Transformers as an external PII model](python/transformers_recognizer/index.md)| -| Usage | Python | [Passing a lambda as a Presidio anonymizer using Faker](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_custom_lambda_anonymizer.py)| -| Usage | REST API (postman) | [Presidio as a REST endpoint](docker/index.md)| -| Deployment | App Service | [Presidio with App Service](deployments/app-service/index.md)| -| Deployment | Kubernetes | [Presidio with Kubernetes](deployments/k8s/index.md)| -| Deployment | Spark/Azure Databricks | [Presidio with Spark](deployments/spark/index.md)| -| Deployment | Azure Data Factory with App Service | [ETL for small dataset](deployments/data-factory/presidio-data-factory.md#option-1-presidio-as-an-http-rest-endpoint) | -| Deployment | Azure Data Factory with Databricks | [ETL for large datasets](deployments/data-factory/presidio-data-factory.md#option-2-presidio-on-azure-databricks) | -| ADF Pipeline | Azure Data Factory | [Add Presidio as an HTTP service to your Azure Data Factory](deployments/data-factory/presidio-data-factory-template-gallery-http.md) | -| ADF Pipeline | Azure Data Factory | [Add Presidio on Databricks to your Azure Data Factory](deployments/data-factory/presidio-data-factory-template-gallery-databricks.md) | -| Demo | Streamlit | [Create a simple demo app using Streamlit](python/streamlit/index.md) +| Topic | Data Type |Resource | Sample | +| :---------- |:--------------------------------------| :---------------------------------| :---------------------------------------------------------------------------------------------------------------------------------------------- | +| Usage | Text | Python Notebook | [Presidio Basic Usage Notebook](https://github.com/microsoft/presidio/blob/main/docs/samples//python/presidio_notebook.ipynb) | +| Usage | Text | Python Notebook | [Customizing Presidio Analyzer](https://github.com/microsoft/presidio/blob/main/docs/samples//python/customizing_presidio_analyzer.ipynb) | +| Usage | Semi-structured | Python Notebook | [Analyzing structured / semi-structured data in batch](https://github.com/microsoft/presidio/blob/main/docs/samples//python/batch_processing.ipynb)| +| Usage | Text | Python Notebook | [Encrypting and Decrypting identified entities](https://github.com/microsoft/presidio/blob/main/docs/samples//python/encrypt_decrypt.ipynb)| +| Usage | Text | Python Notebook | [Getting the identified entity value using a custom Operator](https://github.com/microsoft/presidio/blob/main/docs/samples/python/getting_entity_values.ipynb)| +| Usage | text | Python Notebook | [Anonymizing known values](https://github.com/microsoft/presidio/blob/main/docs/samples/python/Anonymizing%20known%20values.ipynb) +| Usage | Images | Python Notebook | [Redacting Text PII from DICOM images](python/example_dicom_image_redactor.ipynb) +| Usage | Images | Python Notebook | [Using an allow list with image redaction](https://github.com/microsoft/presidio/blob/main/docs/samples/python/image_redaction_allow_list_approach.ipynb) +| Usage | PDF | Python Notebook | [Annotating PII in a PDF](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_pdf_annotation.ipynb) +| Usage | Images | Python Notebook | [Plot custom bounding boxes](https://github.com/microsoft/presidio/blob/main/docs/samples/python/plot_custom_bboxes.ipynb) +| Usage | Text | Python Notebook | [Integrating with external services](https://github.com/microsoft/presidio/blob/main/docs/samples/python/integrating_with_external_services.ipynb) | +| Usage | Text | Python file | [Remote Recognizer](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py) | +| Usage | Text | Python file | [Azure AI Language as a Remote Recognizer](python/text_analytics/index.md) | +| Usage | CSV | Python file | [Analyze and Anonymize CSV file](https://github.com/microsoft/presidio/blob/main/docs/samples/python/process_csv_file.py) | +| Usage | Text | Python | [Using Flair as an external PII model](https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py)| +| Usage | Text | Python file | [Using Transformers as an external PII model](python/transformers_recognizer/index.md)| +| Usage | Text | Python file | [Passing a lambda as a Presidio anonymizer using Faker](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_custom_lambda_anonymizer.py)| +| Usage | | REST API (postman) | [Presidio as a REST endpoint](docker/index.md)| +| Deployment | | App Service | [Presidio with App Service](deployments/app-service/index.md)| +| Deployment | | Kubernetes | [Presidio with Kubernetes](deployments/k8s/index.md)| +| Deployment | | Spark/Azure Databricks | [Presidio with Spark](deployments/spark/index.md)| +| Deployment | | Azure Data Factory with App Service | [ETL for small dataset](deployments/data-factory/presidio-data-factory.md#option-1-presidio-as-an-http-rest-endpoint) | +| Deployment | | Azure Data Factory with Databricks | [ETL for large datasets](deployments/data-factory/presidio-data-factory.md#option-2-presidio-on-azure-databricks) | +| ADF Pipeline | | Azure Data Factory | [Add Presidio as an HTTP service to your Azure Data Factory](deployments/data-factory/presidio-data-factory-template-gallery-http.md) | +| ADF Pipeline | | Azure Data Factory | [Add Presidio on Databricks to your Azure Data Factory](deployments/data-factory/presidio-data-factory-template-gallery-databricks.md) | +| Demo | | Streamlit app | [Create a simple demo app using Streamlit](python/streamlit/index.md) diff --git a/docs/samples/python/Anonymizing known values.ipynb b/docs/samples/python/Anonymizing known values.ipynb index 353efae3d..6f6d09200 100644 --- a/docs/samples/python/Anonymizing known values.ipynb +++ b/docs/samples/python/Anonymizing known values.ipynb @@ -10,7 +10,9 @@ "outputs": [], "source": [ "# download presidio\n", - "!pip install presidio_analyzer presidio_anonymizer" + "!pip install presidio_analyzer presidio_anonymizer\n", + "\n", + "!python -m spacy download en_core_web_lg" ] }, { diff --git a/docs/samples/python/batch_processing.ipynb b/docs/samples/python/batch_processing.ipynb index ad96a1a02..ab836f031 100644 --- a/docs/samples/python/batch_processing.ipynb +++ b/docs/samples/python/batch_processing.ipynb @@ -10,7 +10,7 @@ "outputs": [], "source": [ "# download presidio\n", - "!pip install presidio_analyzer presidio_anonymizer", + "!pip install presidio_analyzer presidio_anonymizer\n", "!python -m spacy download en_core_web_lg" ] }, diff --git a/docs/samples/python/customizing_presidio_analyzer.ipynb b/docs/samples/python/customizing_presidio_analyzer.ipynb index 86c841d67..09173d7b1 100644 --- a/docs/samples/python/customizing_presidio_analyzer.ipynb +++ b/docs/samples/python/customizing_presidio_analyzer.ipynb @@ -14,6 +14,7 @@ "# Customizing the PII analysis process in Microsoft Presidio\n", "\n", "This notebooks covers different customization use cases to:\n", + "\n", "1. Adapt Presidio to detect new types of PII entities\n", "2. Adapt Presidio to detect PII entities in a new language\n", "3. Embed new types of detection modules into Presidio, to improve the coverage of the service." @@ -37,7 +38,7 @@ "outputs": [], "source": [ "# download presidio\n", - "!pip install presidio_analyzer presidio_anonymizer", + "!pip install presidio_analyzer presidio_anonymizer\n", "!python -m spacy download en_core_web_lg" ] diff --git a/docs/samples/python/encrypt_decrypt.ipynb b/docs/samples/python/encrypt_decrypt.ipynb index 8e9deff74..a6a09b655 100644 --- a/docs/samples/python/encrypt_decrypt.ipynb +++ b/docs/samples/python/encrypt_decrypt.ipynb @@ -10,7 +10,7 @@ "outputs": [], "source": [ "# download presidio\n", - "!pip install presidio_analyzer presidio_anonymizer", + "!pip install presidio_analyzer presidio_anonymizer\n", "!python -m spacy download en_core_web_lg" ] }, @@ -252,4 +252,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +} diff --git a/docs/samples/python/image_redaction_allow_list_approach.ipynb b/docs/samples/python/image_redaction_allow_list_approach.ipynb index fc7b38166..91ed0f2f6 100644 --- a/docs/samples/python/image_redaction_allow_list_approach.ipynb +++ b/docs/samples/python/image_redaction_allow_list_approach.ipynb @@ -146,7 +146,7 @@ "metadata": {}, "source": [ "### 1.2 DICOM medical image\n", - "For more information on DICOM image redaction, please see [example_dicom_image_redactor.ipynb](./example_dicom_image_redactor.ipynb) and the [Image redactor module documentation](../../../image-redactor/index.md)." + "For more information on DICOM image redaction, please see [example_dicom_image_redactor.ipynb](./example_dicom_image_redactor.ipynb) and the [Image redactor module documentation](../../image-redactor/index.md)." ] }, { diff --git a/docs/samples/python/index.md b/docs/samples/python/index.md deleted file mode 100644 index 7b55550f4..000000000 --- a/docs/samples/python/index.md +++ /dev/null @@ -1,21 +0,0 @@ -# Using Presidio in a Python script - -## Description - -Presidio service can be used as python packages inside python scripts - -## Table of contents - -1. [Simple analysis and anonymization](presidio_notebook.ipynb) -2. [Developing new PII recognizers](customizing_presidio_analyzer.ipynb) -3. [Remote Recognizer](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py) -4. [Azure Text Analytics Integration](text_analytics/index.md) -5. [Anonymizing known values](Anonymizing%20known%20values.ipynb) -6. [Redacting text PII from DICOM images](example_dicom_image_redactor.ipynb) -7. [Annotating PII in a PDF](example_pdf_annotation.ipynb) -8. [Custom Anonymizer with lambda expression](https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/example_custom_lambda_anonymizer.py) -9. [Running Presidio on structured / semi-structured data in batch](batch_processing.ipynb) -10. [Getting the detected text value using a custom operator](getting_entity_values.ipynb) -11. [Creating a simple demo website](streamlit/index.md) -12. [Using Flair as an external PII model](https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py) -13. [Using Transformers as an external PII model](transformers_recognizer/index.md) diff --git a/docs/samples/python/presidio_notebook.ipynb b/docs/samples/python/presidio_notebook.ipynb index 5e7cae687..23747e786 100644 --- a/docs/samples/python/presidio_notebook.ipynb +++ b/docs/samples/python/presidio_notebook.ipynb @@ -9,7 +9,7 @@ "outputs": [], "source": [ "# download presidio\n", - "!pip install presidio_analyzer presidio_anonymizer", + "!pip install presidio_analyzer presidio_anonymizer\n", "!python -m spacy download en_core_web_lg" ] }, diff --git a/docs/samples/python/transformers_recognizer/index.md b/docs/samples/python/transformers_recognizer/index.md index bd9e46679..7c31b446d 100644 --- a/docs/samples/python/transformers_recognizer/index.md +++ b/docs/samples/python/transformers_recognizer/index.md @@ -1,24 +1,30 @@ -# Run Presidio With Transformers Models +# Add a Transformers model based EntityRecognizer + +!!! note "Note" + + This example demonstrates how to create a **Presidio Recognizer**. + To integrate a transformers model as a **Presidio NLP Engine**, see [this documentation](../../../analyzer/nlp_engines/transformers.md). + + We allow these two options, as a user might want to have multiple NER models running in parallel. In this case, one can create multiple `EntityRecognizer` instances, each serving a different model. If you only plan to use one NER model, consider creating a [`TransformersNlpEngine`](../../../analyzer/nlp_engines/transformers.md) instead of the [`TransformersRecognizer`](https://github.com/microsoft/presidio/blob/main/docs/samples/python/transformers_recognizer/transformer_recognizer.py) described in this document. -This example demonstrates how to extract PII entities using transformers models. When initializing the `TransformersRecognizer`, choose from the following options: -1. A string referencing an uploaded model to HuggingFace. Use this url to access all TokenClassification models - https://huggingface.co/models?pipeline_tag=token-classification&sort=downloads + +1. A string referencing an uploaded model to HuggingFace. See the different available options for models [here](https://huggingface.co/models?pipeline_tag=token-classification&sort=downloads>). 2. Initialize your own `TokenClassificationPipeline` instance using your custom transformers model and use it for inference. 3. Provide the path to your own local custom trained model. !!! note "Note" -For each combination of model & dataset, it is recommended to create a configuration object which includes setting necessary parameters for getting the correct results. Please reference this [configuraion.py](configuration.py) file for examples. + For each combination of model & dataset, it is recommended to create a configuration object which includes setting necessary parameters for getting the correct results. Please reference this [configuraion.py](https://github.cim/microsoft/presidio/blob/miN/configuration.py) file for examples. - - - -### Example Code +## Example Code This example code uses a `TransformersRecognizer` for NER, and removes the default `SpacyRecognizer`. In order to be able to use spaCy features such as lemmas, we introduce the small (and faster) `en_core_web_sm` model. +[link to full TransformersRecognizer code](https://github.com/microsoft/presidio/blob/main/docs/samples/python/transformers_recognizer/transformer_recognizer.py) + ```python from presidio_analyzer import AnalyzerEngine, RecognizerRegistry from presidio_analyzer.nlp_engine import NlpEngineProvider diff --git a/docs/text_anonymization.md b/docs/text_anonymization.md index a73a48cf3..1f989b2b0 100644 --- a/docs/text_anonymization.md +++ b/docs/text_anonymization.md @@ -2,8 +2,8 @@ Presidio's features two main modules for anonymization PII in text: -- [Presidio analyzer](analyzer/index.md): Identification PII in text -- [Presidio anonymizer](anonymizer/index.md): Anonymize detected PII entities using different operators +- [Presidio analyzer](analyzer/index.md): Identification of PII in text +- [Presidio anonymizer](anonymizer/index.md): De-identify detected PII entities using different operators In most cases, we would run the Presidio analyzer to detect where PII entities exist, and then the Presidio anonymizer to remove those using specific operators (such as redact, replace, hash or encrypt) @@ -14,4 +14,3 @@ This figure presents the overall flow in high level: - The [Presidio Analyzer](analyzer/index.md) holds multiple recognizers, each one capable of detecting specific PII entities. These recognizers leverage regular expressions, deny lists, checksum, rule based logic, Named Entity Recognition ML models and context from surrounding words. - The [Presidio Anonymizer](anonymizer/index.md) holds multiple operators, each one can be used to anonymize the PII entity in a different way. Additionally, it can be used to de-anonymize an already anonymized entity (For example, decrypt an encrypted entity) - diff --git a/docs/tutorial/04_external_services.md b/docs/tutorial/04_external_services.md index 92990e0c4..47d231c08 100644 --- a/docs/tutorial/04_external_services.md +++ b/docs/tutorial/04_external_services.md @@ -13,5 +13,5 @@ In a similar way to example 3, we can write logic to call external services for ## Calling a model in a different framework -- [This example](../samples/python/flair_recognizer.py) shows a Presidio wrapper for a Flair model. +- [This example](https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py) shows a Presidio wrapper for a Flair model. - Using a similar approach, we could create wrappers for HuggingFace models, Conditional Random Fields or any other framework. diff --git a/docs/tutorial/05_languages.md b/docs/tutorial/05_languages.md index d5d22be60..c3a019af1 100644 --- a/docs/tutorial/05_languages.md +++ b/docs/tutorial/05_languages.md @@ -47,10 +47,10 @@ print("Results from English request:") print(results_english) ``` -[See this documentation](https://microsoft.github.io/presidio/analyzer/languages/) for more details on how to configure Presidio support additional NLP models and languages. +[See this documentation](https://microsoft.github.io/presidio/analyzer/languages/) for more details on setting up additional NLP models and languages. ## Using external models/frameworks -Some languages are not supported by spaCy/Stanza, or have very limited support in those. In this case, other frameworks could be leveraged. (see [example 4](04_external_services.md) for more information). +Some languages are not supported by spaCy/Stanza/huggingface, or have very limited support in those. In this case, other frameworks could be leveraged. (see [example 4](04_external_services.md) for more information). Since Presidio requires a spaCy model to be passed, we propose to use a simple spaCy pipeline such as `en_core_web_sm` as the NLP engine's model, and a recognizer calling an external framework/service as the Named Entity Recognition (NER) model. diff --git a/docs/tutorial/index.md b/docs/tutorial/index.md index 6d2ea9045..63af3e4d0 100644 --- a/docs/tutorial/index.md +++ b/docs/tutorial/index.md @@ -16,7 +16,7 @@ This tutorials covers different customization use cases to: - [Supporting new models and languages](05_languages.md) - [Calling an external service for PII detection](04_external_services.md) - [Using context words](06_context.md) -- [Tracing the decision process](07_decision_process) +- [Tracing the decision process](07_decision_process.md) - [Loading recognizers from file](08_no_code.md) - [Ad-Hoc recognizers](09_ad_hoc.md) - [Simple anonymization](10_simple_anonymization.md) diff --git a/mkdocs.yml b/mkdocs.yml index f5774ccfb..3fe12f9d0 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -10,25 +10,25 @@ nav: - Home: index.md - Installation: installation.md - Quickstart: getting_started.md - - Step by step tutorial: - - Home: tutorial/index.md - - Getting started: tutorial/00_getting_started.md - - Deny-list recognizers: tutorial/01_deny_list.md - - Regex recognizers: tutorial/02_regex.md - - Rule-based recognizers: tutorial/03_rule_based.md - - Additional models/languages: tutorial/05_languages.md - - External services: tutorial/04_external_services.md - - Context enhancement: tutorial/06_context.md - - Decision process: tutorial/07_decision_process.md - - No-code recognizers: tutorial/08_no_code.md - - Ad-hoc recognizers: tutorial/09_ad_hoc.md - - Simple anonymization: tutorial/10_simple_anonymization.md - - Custom anonymization: tutorial/11_custom_anonymization.md - - Encryption/Decryption: tutorial/12_encryption.md - - Allow-lists: tutorial/13_allow_list.md - - Handling text: - Home: text_anonymization.md + - Step by step tutorial: + - Home: tutorial/index.md + - Getting started: tutorial/00_getting_started.md + - Deny-list recognizers: tutorial/01_deny_list.md + - Regex recognizers: tutorial/02_regex.md + - Rule-based recognizers: tutorial/03_rule_based.md + - Additional models/languages: tutorial/05_languages.md + - External services: tutorial/04_external_services.md + - Context enhancement: tutorial/06_context.md + - Decision process: tutorial/07_decision_process.md + - No-code recognizers: tutorial/08_no_code.md + - Ad-hoc recognizers: tutorial/09_ad_hoc.md + - Simple anonymization: tutorial/10_simple_anonymization.md + - Custom anonymization: tutorial/11_custom_anonymization.md + - Encryption/Decryption: tutorial/12_encryption.md + - Allow-lists: tutorial/13_allow_list.md + - Presidio Analyzer: - Home: analyzer/index.md - Developing PII recognizers: @@ -46,22 +46,24 @@ nav: - Handling images: - Home: image-redactor/index.md - Evaluating DICOM redaction: image-redactor/evaluating_dicom_redaction.md - - Supported entities: supported_entities.md - - Development and design: - - Design: design.md - - Setting up a development environment: development.md - - Build and release process: build_release.md - - Changes from V1 to V2: presidio_V2.md - - Python API reference: - - Home: api.md - - Presidio Analyzer Python API: api/analyzer_python.md - - Presidio Anonymizer Python API: api/anonymizer_python.md - - Presidio Image Redactor Python API: api/image_redactor_python.md - - REST API reference: https://microsoft.github.io/presidio/api-docs/api-docs.html" target="_blank - - Samples: samples/index.md - - Community: community.md - - FAQ: faq.md - - Demo: https://huggingface.co/spaces/presidio/presidio_demo" target="_blank + - Samples: samples/index.md + - General: + - Supported entities: supported_entities.md + - Development and design: + - Design: design.md + - Setting up a development environment: development.md + - Build and release process: build_release.md + - Changes from V1 to V2: presidio_V2.md + - Python API reference: + - Home: api.md + - Presidio Analyzer Python API: api/analyzer_python.md + - Presidio Anonymizer Python API: api/anonymizer_python.md + - Presidio Image Redactor Python API: api/image_redactor_python.md + - REST API reference: https://microsoft.github.io/presidio/api-docs/api-docs.html" target="_blank + + - Community: community.md + - FAQ: faq.md + - Demo: https://huggingface.co/spaces/presidio/presidio_demo" target="_blank theme: name: material custom_dir: overrides @@ -79,6 +81,7 @@ theme: features: - navigation.instant - content.tabs.link + # - navigation.sections # - navigation.tabs # - navigation.tabs.sticky plugins: @@ -111,3 +114,8 @@ markdown_extensions: - pymdownx.pathconverter - pymdownx.tabbed: alternate_style: true + - pymdownx.superfences: + custom_fences: + - name: mermaid + class: mermaid + format: !!python/name:pymdownx.superfences.fence_code_format diff --git a/presidio-analyzer/tests/test_stanza_recognizer.py b/presidio-analyzer/tests/test_stanza_recognizer.py index 9f41968f1..1ed14323b 100644 --- a/presidio-analyzer/tests/test_stanza_recognizer.py +++ b/presidio-analyzer/tests/test_stanza_recognizer.py @@ -76,10 +76,10 @@ def test_when_using_stanza_then_all_stanza_result_correct( @pytest.mark.skip_engine("stanza_en") def test_when_person_in_text_then_person_full_name_complex_found( - spacy_nlp_engine, nlp_recognizer, entities + stanza_nlp_engine, nlp_recognizer, entities ): text = "Richard (Rick) C. Henderson" - results = prepare_and_analyze(spacy_nlp_engine, nlp_recognizer, text, entities) + results = prepare_and_analyze(stanza_nlp_engine, nlp_recognizer, text, entities) assert len(results) > 0