Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New NlpEngine - docs #1177

Merged
merged 78 commits into from
Oct 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
a59d67b
integrating spacy-huggingface-pipeliens and refactoring NlpEngine logic
omri374 Aug 28, 2023
cf44222
Update languages-config.yml
omri374 Aug 28, 2023
7db2320
Update customizing_nlp_models.md
omri374 Aug 28, 2023
b8b7be7
Merge remote-tracking branch 'origin/main' into omri/new_transformers…
omri374 Aug 28, 2023
ede9f9a
Update customizing_nlp_models.md
omri374 Aug 28, 2023
d3edce8
Update default.yaml
omri374 Aug 28, 2023
8508c00
Update languages-config.yml
omri374 Aug 28, 2023
800843d
Update default.yaml
omri374 Aug 28, 2023
f066c31
Update spacy_multilingual.yaml
omri374 Aug 28, 2023
03a7ed8
Update stanza.yaml
omri374 Aug 28, 2023
971c7ec
Update stanza_multilingual.yaml
omri374 Aug 28, 2023
1263f21
default_score from config + more logging
omri374 Aug 28, 2023
5d6de3c
Merge remote-tracking branch 'origin/omri/new_transformers_engine' in…
omri374 Aug 28, 2023
b353c5c
flake8 updates
omri374 Aug 28, 2023
a740c2d
added en_core_web_sm for transformers pipelines
omri374 Aug 28, 2023
b6170b6
removed type_checking option
omri374 Aug 28, 2023
b924242
Merge remote-tracking branch 'origin/omri/new_transformers_engine' in…
omri374 Aug 29, 2023
f0a9924
add transformers_recognizer test
omri374 Aug 29, 2023
28472ab
formatting
omri374 Aug 29, 2023
39103e2
updated docstring
omri374 Aug 29, 2023
1aca153
revert formatting
omri374 Aug 29, 2023
db7c45f
revert formatting
omri374 Aug 29, 2023
e8a814f
ignore type checking errors (TC001 TC002 TC003)
omri374 Aug 29, 2023
f2cd479
small updates to docs
omri374 Aug 29, 2023
5dc5930
update to mkdocs to support tabs in v8
omri374 Aug 29, 2023
5ce545f
added trasnformers extra
omri374 Aug 29, 2023
fcd8ef6
fixed extras
omri374 Aug 29, 2023
b5d56f6
updated extras
omri374 Aug 29, 2023
7364052
Update installation.md
omri374 Aug 29, 2023
7aa5df0
Update getting_started.md
omri374 Aug 29, 2023
1610685
added comment on lazy downloading
omri374 Aug 29, 2023
f2d5843
Update getting_started.md
omri374 Aug 29, 2023
cf85101
revert conf to reduce PR size
omri374 Aug 31, 2023
5a4bb29
Simplified logic between spacy and trasnformers nlp engines
omri374 Aug 31, 2023
7e05bf0
flake8
omri374 Aug 31, 2023
865fae0
Update Pipfile
omri374 Aug 31, 2023
67bf43f
fixed wrong key name
omri374 Aug 31, 2023
01d227c
Merge remote-tracking branch 'origin/omri/new_transformers_engine' in…
omri374 Aug 31, 2023
3724847
line width
omri374 Aug 31, 2023
f063d05
Merge branch 'main' into omri/new_transformers_engine
omri374 Sep 12, 2023
53e0196
Updates to tests and docs
omri374 Sep 18, 2023
b194c92
updates to tests and docs
omri374 Sep 18, 2023
0604028
revert tests to separate PRs
omri374 Sep 18, 2023
35b99e0
revert code to separate PRs
omri374 Sep 18, 2023
7dbd2be
Updates to NlpEngine - tests (#1176)
omri374 Sep 18, 2023
5f9cab6
updates to Stanza NLP engine + tests
omri374 Sep 18, 2023
1c556c9
tests fix
omri374 Sep 19, 2023
b58a799
linting
omri374 Sep 19, 2023
62a9d93
Merge branch 'main' into omri/new_transformers_engine
omri374 Sep 19, 2023
3fb2494
added GPE to mapping
omri374 Sep 19, 2023
0200d1a
reverted installation.md
omri374 Sep 20, 2023
49b0562
reverted getting_started.md
omri374 Sep 20, 2023
023cc8f
Update spacy.yaml
omri374 Sep 20, 2023
f81e3c3
Update spacy_multilingual.yaml
omri374 Sep 20, 2023
5b59e87
Merge branch 'main' into omri/new_transformers_engine
omri374 Sep 20, 2023
2a81f87
changed alignment_model to expand
omri374 Sep 21, 2023
72e7c2b
Update ner_model_configuration.py
omri374 Sep 21, 2023
a338b99
minor changes after more testing
omri374 Sep 26, 2023
52cce16
revert recognizer name change, no need.
omri374 Sep 26, 2023
8238cbd
removed unnecessary field
omri374 Sep 26, 2023
d60315a
updates to docs for new NLP engine and in general
omri374 Sep 26, 2023
79434a2
newline
omri374 Sep 26, 2023
d4839c1
Merge branch 'omri/new_transformers_engine' into omri/new_transformer…
omri374 Sep 26, 2023
d46af19
Merge branch 'main' into omri/new_transformers_engine_docs
omri374 Oct 19, 2023
2940713
Merge branch 'main' into omri/new_transformers_engine_docs
omri374 Oct 22, 2023
e6df67e
Create spelling.yml
omri374 Oct 22, 2023
e3c31e7
Delete .github/spelling.yml
omri374 Oct 22, 2023
8feadb8
Updates to docs
omri374 Oct 22, 2023
88393d0
Merge branch 'omri/new_transformers_engine_docs' of github.com:micros…
omri374 Oct 22, 2023
b634221
removed "attachments"
omri374 Oct 22, 2023
1d80011
Update batch_processing.ipynb
omri374 Oct 22, 2023
4cc77fe
Update batch_processing.ipynb
omri374 Oct 22, 2023
5ab5c8d
Added line between pip and spacy
omri374 Oct 22, 2023
0cb1d9e
fixed markdown in notebook
omri374 Oct 22, 2023
b8955e3
added line between pip and spacy
omri374 Oct 22, 2023
c371f04
revert notebook change
omri374 Oct 22, 2023
cdf09ad
added a line between pip and spacy
omri374 Oct 22, 2023
98e631e
revert docstring change
omri374 Oct 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 33 additions & 10 deletions docs/analyzer/customizing_nlp_models.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Customizing the NLP models in Presidio Analyzer

Presidio uses NLP engines for two main tasks: NER based PII identification,
and feature extraction for custom rule based logic (such as leveraging context words for improved detection).
While Presidio comes with an open-source model (the `en_core_web_lg` model from spaCy),
it can be customized by leveraging other NLP models, either public or proprietary.
These models can be trained or downloaded from existing NLP frameworks like [spaCy](https://spacy.io/usage/models),
[Stanza](https://github.com/stanfordnlp/stanza) and
# Customizing the NLP engine in Presidio Analyzer

Presidio uses NLP engines for two main tasks: NER based PII identification,
and feature extraction for downstream rule based logic (such as leveraging context words for improved detection).
While Presidio comes with an open-source model (the `en_core_web_lg` model from spaCy),
additional NLP models and frameworks could be plugged in, either public or proprietary.
These models can be trained or downloaded from existing NLP frameworks like [spaCy](https://spacy.io/usage/models),
[Stanza](https://github.com/stanfordnlp/stanza) and
[transformers](https://github.com/huggingface/transformers).

In addition, other types of NLP frameworks [can be integrated into Presidio](developing_recognizers.md#machine-learning-ml-based-or-rule-based).
Expand Down Expand Up @@ -63,9 +63,30 @@ Configuration can be done in two ways:
-
lang_code: es
model_name: es_core_news_md
ner_model_configuration:
labels_to_ignore:
- O
SharonHart marked this conversation as resolved.
Show resolved Hide resolved
model_to_presidio_entity_mapping:
PER: PERSON
LOC: LOCATION
ORG: ORGANIZATION
AGE: AGE
ID: ID
DATE: DATE_TIME
low_confidence_score_multiplier: 0.4
low_score_entity_names:
- ID
- ORG
```

The default conf file is read during the default initialization of the `AnalyzerEngine`. Alternatively, the path to a custom configuration file can be passed to the `NlpEngineProvider`:
The `ner_model_configuration` section contains the following parameters:

- `labels_to_ignore`: A list of labels to ignore. For example, `O` (no entity) or entities you are not interested in returning.
- `model_to_presidio_entity_mapping`: A mapping between the transformers model labels and the Presidio entity types.
- `low_confidence_score_multiplier`: A multiplier to apply to the score of entities with low confidence.
- `low_score_entity_names`: A list of entity types to apply the low confidence score multiplier to.

The [default conf file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/conf/default.yaml) is read during the default initialization of the `AnalyzerEngine`. Alternatively, the path to a custom configuration file can be passed to the `NlpEngineProvider`:

```python
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
Expand Down Expand Up @@ -97,12 +118,14 @@ Configuration can be done in two ways:
c. pass requests in each of these languages.

!!! note "Note"
Presidio can currently use one NLP model per language.
Presidio can currently use one NER model per language via the `NlpEngine`. If multiple are required,
consider wrapping NER models as additional recognizers ([see sample here](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py)).

## Leverage frameworks other than spaCy, Stanza and transformers for ML based PII detection

In addition to the built-in spaCy/Stanza/transformers capabitilies, it is possible to create new recognizers which serve as interfaces to other models.
For more information:

- [Remote recognizer documentation](adding_recognizers.md#creating-a-remote-recognizer) and [samples](../samples/python/integrating_with_external_services.ipynb).
- [Flair recognizer example](../samples/python/flair_recognizer.py)

Expand Down
38 changes: 15 additions & 23 deletions docs/analyzer/developing_recognizers.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@ Recognizers define the logic for detection, as well as the confidence a predicti

### Accuracy

Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system. A recognizer with many false positives would affect the system's usability, while a recognizer with many false negatives might require more work before it can be integrated. For reproducibility purposes, it is be best to note how the recognizer's accuracy was tested, and on which datasets.
Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system.
A recognizer with many false positives would affect the system's usability, while a recognizer with many false negatives might require more work before it can be integrated. For reproducibility purposes, it is be best to note how the recognizer's accuracy was tested, and on which datasets.
For tools and documentation on evaluating and analyzing recognizers, refer to the [presidio-research Github repository](https://github.com/microsoft/presidio-research).

!!! note "Note"
Expand All @@ -22,7 +23,8 @@ Make sure your recognizer doesn't take too long to process text. Anything above

### Environment

When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies. In the case of a conflict, one can create an isolated model environment (outside the main presidio-analyzer process) and implement a [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) on the presidio-analyzer side to interact with the model's endpoint. In addition, make sure the license on the 3rd party dependency allows you to use it for any purpose.
When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies.
In the case of a conflict, one can create an isolated model environment (outside the main presidio-analyzer process) and implement a [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) on the presidio-analyzer side to interact with the model's endpoint.

## Recognizer Types

Expand All @@ -32,7 +34,7 @@ Generally speaking, there are three types of recognizers:

A deny list is a list of words that should be removed during text analysis. For example, it can include a list of titles (`["Mr.", "Mrs.", "Ms.", "Dr."]` to detect a "Title" entity.)

See [this documentation](index.md#how-to-add-a-new-recognizer) on adding a new recognizer. The [`PatternRecognizer`](/presidio-analyzer/presidio_analyzer/pattern_recognizer.py) class has built-in support for a deny-list input.
See [this documentation](index.md#how-to-add-a-new-recognizer) on adding a new recognizer. The [`PatternRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/pattern_recognizer.py) class has built-in support for a deny-list input.

### Pattern Based

Expand All @@ -47,36 +49,26 @@ See some examples here:
### Machine Learning (ML) Based or Rule-Based

Many PII entities are undetectable using naive approaches like deny-lists or regular expressions.
In these cases, we would wish to utilize a Machine Learning model capable of identifying entities in free text, or a rule-based recognizer. There are four options for adding ML and rule based recognizers:
In these cases, we would wish to utilize a Machine Learning model capable of identifying entities in free text, or a rule-based recognizer.

#### Utilize SpaCy or Stanza
#### ML: Utilize SpaCy, Stanza or Transformers

Presidio currently uses [spaCy](https://spacy.io/) as a framework for text analysis and Named Entity Recognition (NER), and [stanza](https://stanfordnlp.github.io/stanza/) as an alternative. To avoid introducing new tools, it is recommended to first try to use `spaCy` or `stanza` over other tools if possible.
Presidio currently uses [spaCy](https://spacy.io/) as a framework for text analysis and Named Entity Recognition (NER), and [stanza](https://stanfordnlp.github.io/stanza/) and [huggingface transformers](https://huggingface.co/docs/transformers/index) as an alternative. To avoid introducing new tools, it is recommended to first try to use `spaCy`, `stanza` or `transformers` over other tools if possible.
`spaCy` provides descent results compared to state-of-the-art NER models, but with much better computational performance.
`spaCy` and `stanza` models could be trained from scratch, used in combination with pre-trained embeddings, or retrained to detect new entities.
When integrating such a model into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created.
`spaCy`, `stanza` and `transformers` models could be trained from scratch, used in combination with pre-trained embeddings, or be fine-tuned.

#### Utilize Scikit-learn or Similar

`Scikit-learn` models tend to be fast, but usually have lower accuracy than deep learning methods. However, for well defined problems with well defined features, they can provide very good results.
When integrating such a model into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created.
In addition to those, it is also possible to use other ML models. In that case, a new `EntityRecognizer` should be created.
See an example using [Flair here](https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py).

#### Apply Custom Logic

In some cases, rule-based logic provides the best way of detecting entities.
The Presidio `EntityRecognizer` API allows you to use `spaCy`/`stanza` extracted features like lemmas, part of speech, dependencies and more to create your logic. When integrating such logic into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created.

#### Deep Learning Based Methods

Deep learning methods offer excellent detection rates for NER.
They are however more complex to train, deploy and tend to be slower than traditional approaches.
When creating a DL based method for PII detection, there are two main alternatives for integrating it with Presidio:

1. Create an external endpoint (either local or remote) which is isolated from the `presidio-analyzer` process. On the `presidio-analyzer` side, one would extend the [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) class and implement the network interface between `presidio-analyzer` and the endpoint of the model's container.
2. Integrate the model as an additional [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) within the `presidio-analyzer` flow.
In some cases, rule-based logic provides reasonable ways for detecting entities.
The Presidio `EntityRecognizer` API allows you to use `spaCy` extracted features like lemmas, part of speech, dependencies and more to create your logic.
When integrating such logic into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created.

!!! attention "Considerations for selecting one option over another"

- Accuracy.
- Ease of integration.
- Runtime considerations (For example if the new model requires a GPU).
- 3rd party dependencies of the new model vs. the existing `presidio-analyzer` package.
37 changes: 1 addition & 36 deletions docs/analyzer/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,42 +14,7 @@ Named Entity Recognition and other types of logic to detect PII in unstructured

## Installation

=== "Using pip"

!!! note "Note"
Consider installing the Presidio python packages on a virtual environment like venv or conda.

To get started with Presidio-analyzer,
download the package and the `en_core_web_lg` spaCy model:

```sh
pip install presidio-analyzer
python -m spacy download en_core_web_lg
```

=== "Using Docker"

!!! note "Note"
This requires Docker to be installed. [Download Docker](https://docs.docker.com/get-docker/).

```sh
# Download image from Dockerhub
docker pull mcr.microsoft.com/presidio-analyzer

# Run the container with the default port
docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest
```

=== "From source"

First, clone the Presidio repo. [See here for instructions](../installation.md#install-from-source).

Then, build the presidio-analyzer container:

```sh
cd presidio-analyzer
docker build . -t presidio/presidio-analyzer
```
see [Installing Presidio](../installation.md).

## Getting started

Expand Down
3 changes: 2 additions & 1 deletion docs/analyzer/languages.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ analyzer = AnalyzerEngine(

analyzer.analyze(text="My name is David", language="en")
```
Link to LANGUAGES_CONFIG_FILE=[languages-config.yml](https://github.com/microsoft/presidio/blob/main/docs/analyzer/languages-config.yml)

### Automatically install NLP models into the Docker container

Expand All @@ -73,4 +74,4 @@ update the [conf/default.yaml](https://github.com/microsoft/presidio/blob/main/p
the `docker build` phase and the models defined in it are installed automatically.

For `transformers` based models, the configuration [can be found here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/conf/transformers.yaml).
In addition, make sure the Docker file contains the relevant packages for `transformers`, which are not loaded automatically with Presidio.
A docker file supporting transformers models [can be found here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/Dockerfile.transformers).
17 changes: 16 additions & 1 deletion docs/analyzer/nlp_engines/spacy_stanza.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,26 @@ For the available models, follow these links: [spaCy](https://spacy.io/usage/mod
!!! tip "Tip"
For Person, Location and Organization detection, it could be useful to try out the transformers based models (e.g. `en_core_web_trf`) which uses a more modern deep-learning architecture, but is generally slower than the default `en_core_web_lg` model.


### Configure Presidio to use the pre-trained model

Once created, see [the NLP configuration documentation](../customizing_nlp_models.md#Configure-Presidio-to-use-the-new-model) for more information.

## How NER results flow within Presidio
This diagram describes the flow of NER results within Presidio, and the relationship between the `SpacyNlpEngine` component and the `SpacyRecognizer` component:
```mermaid
sequenceDiagram
AnalyzerEngine->>SpacyNlpEngine: Call engine.process_text(text) <br>to get model results
SpacyNlpEngine->>spaCy: Call spaCy pipeline
spaCy->>SpacyNlpEngine: return entities and other attributes
Note over SpacyNlpEngine: Map entity names to Presidio's, <BR>update scores, <BR>remove unwanted entities <BR> based on NerModelConfiguration
SpacyNlpEngine->>AnalyzerEngine: Pass NlpArtifacts<BR>(Entities, lemmas, tokens, scores etc.)
Note over AnalyzerEngine: Call all recognizers
AnalyzerEngine->>SpacyRecognizer: Pass NlpArtifacts
Note over SpacyRecognizer: Extract PII entities out of NlpArtifacts
SpacyRecognizer->>AnalyzerEngine: Return List[RecognizerResult]

```

## Training your own model

!!! note "Note"
Expand Down
Loading