Adding a sample for identifying PII in a PDF (#1023)

microsoft · Jan 30, 2023 · 07b854d · 07b854d
1 parent ad68363
commit 07b854d
Show file tree

Hide file tree

Showing 4 changed files with 262 additions and 6 deletions.
diff --git a/docs/samples/index.md b/docs/samples/index.md
@@ -16,6 +16,7 @@
 | Usage       | Python                                | [Using Transformers as an external PII model](python/transformers_recognizer/index.md)                                               |
 | Usage       | Python Notebook                       | [Anonymizing known values](python/Anonymizing%20known%20values.ipynb)
 | Usage       | Python Notebook                       | [Redacting text PII from DICOM images](python/example_dicom_image_redactor.ipynb)
+| Usage       | Python Notebook                       | [Annotating PII in a PDF](python/example_pdf_annotation.ipynb)
 | Usage       | REST API (postman)                    | [Presidio as a REST endpoint](docker/index.md)                                              |
 | Deployment  | App Service                           | [Presidio with App Service](deployments/app-service/index.md)                                                                                   |
 | Deployment  | Kubernetes                            | [Presidio with Kubernetes](deployments/k8s/index.md)                                                                                            |

diff --git a/docs/samples/python/example_pdf_annotation.ipynb b/docs/samples/python/example_pdf_annotation.ipynb
@@ -0,0 +1,253 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Annotating PII in a PDF\n",
+    "\n",
+    "This sample takes a PDF as an input, extracts the text, identifies PII using Presidio and annotates the PII using highlight annotations.\n",
+    "\n",
+    "## Prerequisites\n",
+    "Before getting started, make sure the following packages are installed. For detailed documentation, see the [installation docs](https://microsoft.github.io/presidio/installation).\n",
+    "\n",
+    "Install from PyPI:\n",
+    "```bash\n",
+    "pip install presidio_analyzer\n",
+    "pip install presidio_anonymizer\n",
+    "python -m spacy download en_core_web_lg\n",
+    "pip install pdfminer.six\n",
+    "pip install pikepdf\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# For Presidio\n",
+    "from presidio_analyzer import AnalyzerEngine, PatternRecognizer\n",
+    "from presidio_anonymizer import AnonymizerEngine\n",
+    "from presidio_anonymizer.entities import OperatorConfig\n",
+    "\n",
+    "# For console output\n",
+    "from pprint import pprint\n",
+    "\n",
+    "# For extracting text\n",
+    "from pdfminer.high_level import extract_text, extract_pages\n",
+    "from pdfminer.layout import LTTextContainer, LTChar, LTTextLine\n",
+    "\n",
+    "# For updating the PDF\n",
+    "from pikepdf import Pdf, AttachedFileSpec, Name, Dictionary, Array"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Analyze the text in the PDF\n",
+    "\n",
+    "To extract the text from the PDF, we use the pdf miner library. We extract the text from the PDF at the text container level. This is roughly equivalent to a paragraph. \n",
+    "\n",
+    "We then use Presidio Analyzer to identify the PII and it's location in the text.\n",
+    "\n",
+    "The Presidio analyzer is using pre-defined entity recognizers, and offers the option to create custom recognizers.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "analyzer = AnalyzerEngine()\n",
+    "\n",
+    "analyzed_character_sets = []\n",
+    "\n",
+    "for page_layout in extract_pages(\"./sample_data/sample.pdf\"):\n",
+    "    for text_container in page_layout:\n",
+    "        if isinstance(text_container, LTTextContainer):\n",
+    "\n",
+    "            # The element is a LTTextContainer, containing a paragraph of text.\n",
+    "            text_to_anonymize = text_container.get_text()\n",
+    "\n",
+    "            # Analyze the text using the analyzer engine\n",
+    "            analyzer_results = analyzer.analyze(text=text_to_anonymize, language='en')\n",
+    " \n",
+    "            if text_to_anonymize.isspace() == False:\n",
+    "                print(text_to_anonymize)\n",
+    "                print(analyzer_results)\n",
+    "\n",
+    "            characters = list([])\n",
+    "\n",
+    "            # Grab the characters from the PDF\n",
+    "            for text_line in filter(lambda t: isinstance(t, LTTextLine), text_container):\n",
+    "                    for character in filter(lambda t: isinstance(t, LTChar), text_line):\n",
+    "                            characters.append(character)\n",
+    "\n",
+    "\n",
+    "            # Slice out the characters that match the analyzer results.\n",
+    "            for result in analyzer_results:\n",
+    "                start = result.start\n",
+    "                end = result.end\n",
+    "                analyzed_character_sets.append({\"characters\": characters[start:end], \"result\": result})"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create phrase bounding boxes\n",
+    "\n",
+    "The next task is to take the character data, and inflate it into full phrase bounding boxes.\n",
+    "\n",
+    "For example, for an email address, we'll turn the bounding boxes for each character in the email address into one single bounding box."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Combine the bounding boxes into a single bounding box.\n",
+    "def combine_rect(rectA, rectB):\n",
+    "    a, b = rectA, rectB\n",
+    "    startX = min( a[0], b[0] )\n",
+    "    startY = min( a[1], b[1] )\n",
+    "    endX = max( a[2], b[2] )\n",
+    "    endY = max( a[3], b[3] )\n",
+    "    return (startX, startY, endX, endY)\n",
+    "\n",
+    "analyzed_bounding_boxes = []\n",
+    "\n",
+    "# For each character set, combine the bounding boxes into a single bounding box.\n",
+    "for analyzed_character_set in analyzed_character_sets:\n",
+    "    completeBoundingBox = analyzed_character_set[\"characters\"][0].bbox\n",
+    "    \n",
+    "    for character in analyzed_character_set[\"characters\"]:\n",
+    "        completeBoundingBox = combine_rect(completeBoundingBox, character.bbox)\n",
+    "    \n",
+    "    analyzed_bounding_boxes.append({\"boundingBox\": completeBoundingBox, \"result\": analyzed_character_set[\"result\"]})"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Add highlight annotations\n",
+    "\n",
+    "We finally iterate through all the analyzed bounding boxes and create highlight annotations for all of them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pdf = Pdf.open(\"./sample_data/sample.pdf\")\n",
+    "\n",
+    "annotations = []\n",
+    "\n",
+    "# Create a highlight annotation for each bounding box.\n",
+    "for analyzed_bounding_box in analyzed_bounding_boxes:\n",
+    "\n",
+    "    boundingBox = analyzed_bounding_box[\"boundingBox\"]\n",
+    "\n",
+    "    # Create the annotation. \n",
+    "    # We could also create a redaction annotation if the ongoing workflows supports them.\n",
+    "    highlight = Dictionary(\n",
+    "        Type=Name.Annot,\n",
+    "        Subtype=Name.Highlight,\n",
+    "        QuadPoints=[boundingBox[0], boundingBox[3],\n",
+    "                    boundingBox[2], boundingBox[3],\n",
+    "                    boundingBox[0], boundingBox[1],\n",
+    "                    boundingBox[2], boundingBox[1]],\n",
+    "        Rect=[boundingBox[0], boundingBox[1], boundingBox[2], boundingBox[3]],\n",
+    "        C=[1, 0, 0],\n",
+    "        CA=0.5,\n",
+    "        T=analyzed_bounding_box[\"result\"].entity_type,\n",
+    "    )\n",
+    "    \n",
+    "    annotations.append(highlight)\n",
+    "\n",
+    "# Add the annotations to the PDF.\n",
+    "pdf.pages[0].Annots = pdf.make_indirect(annotations)\n",
+    "\n",
+    "# And save.\n",
+    "pdf.save(\"./sample_data/sample_annotated.pdf\")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Result\n",
+    "\n",
+    "The output from the samples above creates a new PDF. This contains the original content, with text highlight annotations where the PII has been found.\n",
+    "\n",
+    "Each text annotation contains the name of the entity found.\n",
+    "\n",
+    "## Note\n",
+    "\n",
+    "Before relying on this methodology to detect or markup PII from a PDF, please be aware of the following:\n",
+    "\n",
+    "### Text extraction\n",
+    "\n",
+    "We purposely use a different library specifically for extracting text from the PDF. This is because text extraction is hard to get right, and it's worth using a library specifically developed with the purpose in mind.\n",
+    "\n",
+    "For more details, see:\n",
+    "\n",
+    "[https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html](https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html)\n",
+    "\n",
+    "That said, even with a purpose built library, there may be occasions where PII is present and visible in a PDF, but it is not detected by the sample code.\n",
+    "\n",
+    "This includes, but is not limited to:\n",
+    "\n",
+    "- Text that cannot be reliable extracted to be analyzed. (e.g. incorrect spacing, wrong reading order)\n",
+    "- Text present in previous iterations of the PDF which is hidden from text extraction. (See incremental editing)\n",
+    "- Text present in images. (requires OCRing)\n",
+    "- Text present in annotations."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.7"
+  },
+  "metadata": {
+   "interpreter": {
+    "hash": "1baa965d5efe3ac65b79dfc60c0d706280b1da80fedb7760faf2759126c4f253"
+   }
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "32dc7f589eabbfcc8721d5e42a136d6d28d1f0d89a4194393b25e6e4360e3a4e"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/docs/samples/python/index.md b/docs/samples/python/index.md
@@ -11,9 +11,11 @@ Presidio service can be used as python packages inside python scripts
 3. [Remote Recognizer](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py)
 4. [Azure Text Analytics Integration](text_analytics/index.md)
 5. [Anonymizing known values](Anonymizing%20known%20values.ipynb)
-6. [Custom Anonymizer with lambda expression](example_custom_lambda_anonymizer.py)
-7. [Running Presidio on structured / semi-structured data in batch](batch_processing.ipynb)
-8. [Getting the detected text value using a custom operator](getting_entity_values.ipynb)
-9. [Creating a simple demo website](streamlit/index.md)
-10. [Using Flair as an external PII model](flair_recognizer.py)
-11. [Using Transformers as an external PII model](transformers_recognizer.py)
+6. [Redacting text PII from DICOM images](example_dicom_image_redactor.ipynb)
+7. [Annotating PII in a PDF](example_pdf_annotation.ipynb)
+8. [Custom Anonymizer with lambda expression](example_custom_lambda_anonymizer.py)
+9. [Running Presidio on structured / semi-structured data in batch](batch_processing.ipynb)
+10. [Getting the detected text value using a custom operator](getting_entity_values.ipynb)
+11. [Creating a simple demo website](streamlit/index.md)
+12. [Using Flair as an external PII model](flair_recognizer.py)
+13. [Using Transformers as an external PII model](transformers_recognizer.py)
diff --git a/docs/samples/python/sample_data/sample.pdf b/docs/samples/python/sample_data/sample.pdf