-
Notifications
You must be signed in to change notification settings - Fork 581
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adding a sample for identifying PII in a PDF (#1023)
- Loading branch information
Showing
4 changed files
with
262 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,253 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Annotating PII in a PDF\n", | ||
"\n", | ||
"This sample takes a PDF as an input, extracts the text, identifies PII using Presidio and annotates the PII using highlight annotations.\n", | ||
"\n", | ||
"## Prerequisites\n", | ||
"Before getting started, make sure the following packages are installed. For detailed documentation, see the [installation docs](https://microsoft.github.io/presidio/installation).\n", | ||
"\n", | ||
"Install from PyPI:\n", | ||
"```bash\n", | ||
"pip install presidio_analyzer\n", | ||
"pip install presidio_anonymizer\n", | ||
"python -m spacy download en_core_web_lg\n", | ||
"pip install pdfminer.six\n", | ||
"pip install pikepdf\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# For Presidio\n", | ||
"from presidio_analyzer import AnalyzerEngine, PatternRecognizer\n", | ||
"from presidio_anonymizer import AnonymizerEngine\n", | ||
"from presidio_anonymizer.entities import OperatorConfig\n", | ||
"\n", | ||
"# For console output\n", | ||
"from pprint import pprint\n", | ||
"\n", | ||
"# For extracting text\n", | ||
"from pdfminer.high_level import extract_text, extract_pages\n", | ||
"from pdfminer.layout import LTTextContainer, LTChar, LTTextLine\n", | ||
"\n", | ||
"# For updating the PDF\n", | ||
"from pikepdf import Pdf, AttachedFileSpec, Name, Dictionary, Array" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Analyze the text in the PDF\n", | ||
"\n", | ||
"To extract the text from the PDF, we use the pdf miner library. We extract the text from the PDF at the text container level. This is roughly equivalent to a paragraph. \n", | ||
"\n", | ||
"We then use Presidio Analyzer to identify the PII and it's location in the text.\n", | ||
"\n", | ||
"The Presidio analyzer is using pre-defined entity recognizers, and offers the option to create custom recognizers.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"analyzer = AnalyzerEngine()\n", | ||
"\n", | ||
"analyzed_character_sets = []\n", | ||
"\n", | ||
"for page_layout in extract_pages(\"./sample_data/sample.pdf\"):\n", | ||
" for text_container in page_layout:\n", | ||
" if isinstance(text_container, LTTextContainer):\n", | ||
"\n", | ||
" # The element is a LTTextContainer, containing a paragraph of text.\n", | ||
" text_to_anonymize = text_container.get_text()\n", | ||
"\n", | ||
" # Analyze the text using the analyzer engine\n", | ||
" analyzer_results = analyzer.analyze(text=text_to_anonymize, language='en')\n", | ||
" \n", | ||
" if text_to_anonymize.isspace() == False:\n", | ||
" print(text_to_anonymize)\n", | ||
" print(analyzer_results)\n", | ||
"\n", | ||
" characters = list([])\n", | ||
"\n", | ||
" # Grab the characters from the PDF\n", | ||
" for text_line in filter(lambda t: isinstance(t, LTTextLine), text_container):\n", | ||
" for character in filter(lambda t: isinstance(t, LTChar), text_line):\n", | ||
" characters.append(character)\n", | ||
"\n", | ||
"\n", | ||
" # Slice out the characters that match the analyzer results.\n", | ||
" for result in analyzer_results:\n", | ||
" start = result.start\n", | ||
" end = result.end\n", | ||
" analyzed_character_sets.append({\"characters\": characters[start:end], \"result\": result})" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Create phrase bounding boxes\n", | ||
"\n", | ||
"The next task is to take the character data, and inflate it into full phrase bounding boxes.\n", | ||
"\n", | ||
"For example, for an email address, we'll turn the bounding boxes for each character in the email address into one single bounding box." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Combine the bounding boxes into a single bounding box.\n", | ||
"def combine_rect(rectA, rectB):\n", | ||
" a, b = rectA, rectB\n", | ||
" startX = min( a[0], b[0] )\n", | ||
" startY = min( a[1], b[1] )\n", | ||
" endX = max( a[2], b[2] )\n", | ||
" endY = max( a[3], b[3] )\n", | ||
" return (startX, startY, endX, endY)\n", | ||
"\n", | ||
"analyzed_bounding_boxes = []\n", | ||
"\n", | ||
"# For each character set, combine the bounding boxes into a single bounding box.\n", | ||
"for analyzed_character_set in analyzed_character_sets:\n", | ||
" completeBoundingBox = analyzed_character_set[\"characters\"][0].bbox\n", | ||
" \n", | ||
" for character in analyzed_character_set[\"characters\"]:\n", | ||
" completeBoundingBox = combine_rect(completeBoundingBox, character.bbox)\n", | ||
" \n", | ||
" analyzed_bounding_boxes.append({\"boundingBox\": completeBoundingBox, \"result\": analyzed_character_set[\"result\"]})" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Add highlight annotations\n", | ||
"\n", | ||
"We finally iterate through all the analyzed bounding boxes and create highlight annotations for all of them." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"pdf = Pdf.open(\"./sample_data/sample.pdf\")\n", | ||
"\n", | ||
"annotations = []\n", | ||
"\n", | ||
"# Create a highlight annotation for each bounding box.\n", | ||
"for analyzed_bounding_box in analyzed_bounding_boxes:\n", | ||
"\n", | ||
" boundingBox = analyzed_bounding_box[\"boundingBox\"]\n", | ||
"\n", | ||
" # Create the annotation. \n", | ||
" # We could also create a redaction annotation if the ongoing workflows supports them.\n", | ||
" highlight = Dictionary(\n", | ||
" Type=Name.Annot,\n", | ||
" Subtype=Name.Highlight,\n", | ||
" QuadPoints=[boundingBox[0], boundingBox[3],\n", | ||
" boundingBox[2], boundingBox[3],\n", | ||
" boundingBox[0], boundingBox[1],\n", | ||
" boundingBox[2], boundingBox[1]],\n", | ||
" Rect=[boundingBox[0], boundingBox[1], boundingBox[2], boundingBox[3]],\n", | ||
" C=[1, 0, 0],\n", | ||
" CA=0.5,\n", | ||
" T=analyzed_bounding_box[\"result\"].entity_type,\n", | ||
" )\n", | ||
" \n", | ||
" annotations.append(highlight)\n", | ||
"\n", | ||
"# Add the annotations to the PDF.\n", | ||
"pdf.pages[0].Annots = pdf.make_indirect(annotations)\n", | ||
"\n", | ||
"# And save.\n", | ||
"pdf.save(\"./sample_data/sample_annotated.pdf\")" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Result\n", | ||
"\n", | ||
"The output from the samples above creates a new PDF. This contains the original content, with text highlight annotations where the PII has been found.\n", | ||
"\n", | ||
"Each text annotation contains the name of the entity found.\n", | ||
"\n", | ||
"## Note\n", | ||
"\n", | ||
"Before relying on this methodology to detect or markup PII from a PDF, please be aware of the following:\n", | ||
"\n", | ||
"### Text extraction\n", | ||
"\n", | ||
"We purposely use a different library specifically for extracting text from the PDF. This is because text extraction is hard to get right, and it's worth using a library specifically developed with the purpose in mind.\n", | ||
"\n", | ||
"For more details, see:\n", | ||
"\n", | ||
"[https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html](https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html)\n", | ||
"\n", | ||
"That said, even with a purpose built library, there may be occasions where PII is present and visible in a PDF, but it is not detected by the sample code.\n", | ||
"\n", | ||
"This includes, but is not limited to:\n", | ||
"\n", | ||
"- Text that cannot be reliable extracted to be analyzed. (e.g. incorrect spacing, wrong reading order)\n", | ||
"- Text present in previous iterations of the PDF which is hidden from text extraction. (See incremental editing)\n", | ||
"- Text present in images. (requires OCRing)\n", | ||
"- Text present in annotations." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": ".venv", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.9.7" | ||
}, | ||
"metadata": { | ||
"interpreter": { | ||
"hash": "1baa965d5efe3ac65b79dfc60c0d706280b1da80fedb7760faf2759126c4f253" | ||
} | ||
}, | ||
"vscode": { | ||
"interpreter": { | ||
"hash": "32dc7f589eabbfcc8721d5e42a136d6d28d1f0d89a4194393b25e6e4360e3a4e" | ||
} | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.