This is a dockerized Flask app for anonymizing Personal Identifiable Information (PII) in text, such as person name, phone number, credit card etc. The docker image can be deployed both on premise or to the cloud (This repository contains example scripts for deploying to AWS).
The app utilize the Presidio library for detecting and anonymizing PII. The supported entities can be found here. Currently this app only support PII anonymization for texts in English and Norwegian.
Run start_up.sh
from the root directory, with optionally two argument for image name and the port number. For example:
./start_up.sh pii-anonymizer 8989
This script does two things
- Build docker image with the provided name, if not provided, the name is by default
pii-anonymizer
- Serves the docker app, listing to the provided port at the host machine, by default the port number is 8989
Subsequently, you can invoke by running
scripts/predict.sh data/input.json
The API expected text data in JSON format as following:
{"input":
[
{"text" : "Hello Paulo Santos. The latest statement for your credit card account 1111-0000-1111-0000 was mailed to 123 Any Street, Seattle, WA 98109.", "lang": "en"},
{"text" : "My phone number is 212-555-5555", "lang": "en"},
{"text": "Hello this is Jamie Clark calling", "lang": "en"},
],
"mode": "tagged_text"
}
the "lang"
field specifies the language of the text, currently
supports "en", "no" or "unknown"
. Specifying the language would
save time for the anonymizer, since it does not need to load the
language detection module and run the detector.
Choices for the "mode"
filed includes "tagged_text"
for getting
result with PII masked with tags such as , .
For example:
{
"output": {
"output": [
{
"tagged_text": "Hello <PERSON>. The latest statement for your credit card account <CREDIT_CARD> was mailed to 123 Any Street, <LOCATION>, WA 98109."
},
{
"tagged_text": "My phone number is <PHONE_NUMBER>"
},
{
"tagged_text": "Hello this is <PERSON> calling"
}
]
}
}
"detailed_info"
for getting detailed result per PII which
contains the start index, end index, entity type and entity itself. For example:
{
"output": {
"output": [
{
"detailed_info": [
{
"entity_type": "PERSON",
"start": 6,
"end": 18,
"score": 0.85,
"entity": "Paulo Santos"
},
{
"entity_type": "LOCATION",
"start": 120,
"end": 127,
"score": 0.85,
"entity": "Seattle"
}
]
},
{
"detailed_info": [
{
"entity_type": "PHONE_NUMBER",
"start": 19,
"end": 31,
"score": 0.75,
"entity": "212-555-5555"
}
]
},
{
"detailed_info": [
{
"entity_type": "PERSON",
"start": 14,
"end": 25,
"score": 0.85,
"entity": "Jamie Clark"
}
]
}
]
}
}
-
Push the image to ECR by running following command from
/container
directory./build_and_push.sh