Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in pdf document unnecessary redact or masking is coming #1355

Open
BhargaviRM opened this issue Apr 10, 2024 · 4 comments
Open

in pdf document unnecessary redact or masking is coming #1355

BhargaviRM opened this issue Apr 10, 2024 · 4 comments

Comments

@BhargaviRM
Copy link

BhargaviRM commented Apr 10, 2024

Describe the bug
in pdf document unnecessary redact or masking is coming

from presidio_analyzer import AnalyzerEngine
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar, LTTextLine
from pikepdf import Pdf, Dictionary, Name

analyzer = AnalyzerEngine()

analyzed_bounding_boxes = []

for page_layout in extract_pages(r"input.pdf"):
    for text_container in page_layout:
        if isinstance(text_container, LTTextContainer):
            text_to_analyze = text_container.get_text()
            analyzer_results = analyzer.analyze(text=text_to_analyze, language='en')
            characters = [char for line in text_container if isinstance(line, LTTextLine) for char in line if isinstance(char, LTChar)]
            analyzed_bounding_boxes.extend({"boundingBox": char.bbox, "result": result} for result in analyzer_results for char in characters[result.start:result.end])

pdf = Pdf.open(r"input.pdf")
annotations = []

for analyzed_bounding_box in analyzed_bounding_boxes:
    bounding_box = analyzed_bounding_box["boundingBox"]
    annotation = Dictionary(
        Type=Name.Annot,
        Subtype=Name.Highlight,
        QuadPoints=[bounding_box[0], bounding_box[3], bounding_box[2], bounding_box[3], bounding_box[0], bounding_box[1], bounding_box[2], bounding_box[1]],
        Rect=[bounding_box[0], bounding_box[1], bounding_box[2], bounding_box[3]],
        C=[0, 0, 0],
        CA=0.5,
        T=analyzed_bounding_box["result"].entity_type
    )
    annotations.append(annotation)

for page_num, page in enumerate(pdf.pages):
    page.Annots = pdf.make_indirect(annotations)
pdf.save("result.pdf")

image

in this screenshot there is no sensitive information below still is masking, unnecessarily it is masking, please help with this

@omri374
Copy link
Contributor

omri374 commented Apr 12, 2024

Can you please print the output of analyzed_bounding_boxes?

@omri374
Copy link
Contributor

omri374 commented Apr 12, 2024

Also, is this a scanned pdf or a digital one?

@BhargaviRM
Copy link
Author

Printed bounding boxes getting this result, this pdf is digital one

[{'boundingBox': (144.576, 269.57, 150.66, 281.57), 'result': type: DATE_TIME, start: 16, end: 26, score: 0.95}, {'boundingBox': (150.576, 269.57, 156.66, 281.57), 'result': type: DATE_TIME, start: 16, end: 26, score: 0.95}, {'boundingBox': (156.66, 269.57, 161.292, 281.57), 'result': type: DATE_TIME, start: 16, end: 26, score: 0.95}, {'boundingBox': (161.364, 269.57, 167.448, 281.57), 'result': type: DATE_TIME, start: 16, end: 26, score: 0.95}, {'boundingBox': (167.364, 269.57, 173.448, 281.57), 'result': type: DATE_TIME, start: 16, end: 26, score: 0.95}, {'boundingBox': (173.448, 269.57, 178.08, 281.57), 'result': type: DATE_TIME, start: 16, end: 26, score: 0.95}, {'boundingBox': (178.15200000000002, 269.57, 184.23600000000002, 281.57), 'result': type: DATE_TIME, start: 16, end: 26, score: 0.95}, {'boundingBox': (184.15200000000002, 269.57, 190.23600000000002, 281.57), 'result': type: DATE_TIME, start: 16, end: 26, score: 0.95}, {'boundingBox': (190.15200000000002, 269.57, 196.23600000000002, 281.57), 'result': type: DATE_TIME, start: 16, end: 26, score: 0.95}, {'boundingBox': (196.23600000000002, 269.57, 202.32000000000002, 281.57), 'result': type: DATE_TIME, start: 16, end: 26, score: 0.95}, {'boundingBox': (193.044, 245.81, 199.12800000000001, 257.81), 'result': type: US_SSN, start: 25, end: 36, score: 0.85}, {'boundingBox': (199.12800000000001, 245.81, 205.21200000000002, 257.81), 'result': type: US_SSN, start: 25, end: 36, score: 0.85}, {'boundingBox': (205.27200000000002, 245.81, 211.35600000000002, 257.81), 'result': type: US_SSN, start: 25, end: 36, score: 0.85}, {'boundingBox': (211.49, 245.81, 215.162, 257.81), 'result': type: US_SSN, start: 25, end: 36, score: 0.85}, {'boundingBox': (215.09, 245.81, 221.174, 257.81), 'result': type: US_SSN, start: 25, end: 36, score: 0.85}, {'boundingBox': (221.21, 245.81, 227.294, 257.81), 'result': type: US_SSN, start: 25, end: 36, score: 0.85}, {'boundingBox': (227.33, 245.81, 231.002, 257.81), 'result': type: US_SSN, start: 25, end: 36, score: 0.85}, {'boundingBox': (230.93, 245.81, 237.014, 257.81), 'result': type: US_SSN, start: 25, end: 36, score: 0.85}, {'boundingBox': (237.014, 245.81, 243.098, 257.81), 'result': type: US_SSN, start: 25, end: 36, score: 0.85}, {'boundingBox': (243.15800000000002, 245.81, 249.24200000000002, 257.81), 'result': type: US_SSN, start: 25, end: 36, score: 0.85}, {'boundingBox': (249.15800000000002, 245.81, 255.24200000000002, 257.81), 'result': type: US_SSN, start: 25, end: 36, score: 0.85}, {'boundingBox': (204.42, 198.14, 210.81599999999997, 210.14), 'result': type: LOCATION, start: 27, end: 42, score: 0.85}, {'boundingBox': (210.77999999999997, 198.14, 216.52799999999996, 210.14), 'result': type: LOCATION, start: 27, end: 42, score: 0.85}, {'boundingBox': (216.52799999999996, 198.14, 219.28799999999995, 210.14), 'result': type: LOCATION, start: 27, end: 42, score: 0.85}, {'boundingBox': (219.28799999999995, 198.14, 222.04799999999994, 210.14), 'result': type: LOCATION, start: 27, end: 42, score: 0.85}, {'boundingBox': (222.04799999999994, 198.14, 225.70799999999994, 210.14), 'result': type: LOCATION, start: 27, end: 42, score: 0.85}, {'boundingBox': (225.41999999999993, 198.14, 231.74399999999994, 210.14), 'result': type: LOCATION, start: 27, end: 42, score: 0.85}, {'boundingBox': (231.74399999999994, 198.14, 235.93199999999993, 210.14), 'result': type: LOCATION, start: 27, end: 42, score: 0.85}, {'boundingBox': (235.96799999999993, 198.14, 242.26799999999994, 210.14), 'result': type: LOCATION, start: 27, end: 42, score: 0.85}, {'boundingBox': (242.31599999999995, 198.14, 245.07599999999994, 210.14), 'result': type: LOCATION, start: 27, end: 42, score: 0.85}, {'boundingBox': (245.07599999999994, 198.14, 250.82399999999993, 210.14), 'result': type: LOCATION, start: 27, end: 42, score: 0.85}, {'boundingBox': (250.82399999999993, 198.14, 253.53599999999992, 210.14), 'result': type: LOCATION, start: 27, end: 42, score: 0.85}, {'boundingBox': (253.58399999999992, 198.14, 259.9799999999999, 210.14), 'result': type: LOCATION, start: 27, end: 42, score: 0.85}, {'boundingBox': (259.9439999999999, 198.14, 262.7039999999999, 210.14), 'result': type: LOCATION, start: 27, end: 42, score: 0.85}, {'boundingBox': (262.7039999999999, 198.14, 266.7239999999999, 210.14), 'result': type: LOCATION, start: 27, end: 42, score: 0.85}, {'boundingBox': (266.6639999999999, 198.14, 272.09999999999985, 210.14), 'result': type: LOCATION, start: 27, end: 42, score: 0.85}, {'boundingBox': (204.42, 198.14, 210.81599999999997, 210.14), 'result': type: IN_PAN, start: 27, end: 37, score: 0.05}, {'boundingBox': (210.77999999999997, 198.14, 216.52799999999996, 210.14), 'result': type: IN_PAN, start: 27, end: 37, score: 0.05}, {'boundingBox': (216.52799999999996, 198.14, 219.28799999999995, 210.14), 'result': type: IN_PAN, start: 27, end: 37, score: 0.05}, {'boundingBox': (219.28799999999995, 198.14, 222.04799999999994, 210.14), 'result': type: IN_PAN, start: 27, end: 37, score: 0.05}, {'boundingBox': (222.04799999999994, 198.14, 225.70799999999994, 210.14), 'result': type: IN_PAN, start: 27, end: 37, score: 0.05}, {'boundingBox': (225.41999999999993, 198.14, 231.74399999999994, 210.14), 'result': type: IN_PAN, start: 27, end: 37, score: 0.05}, {'boundingBox': (231.74399999999994, 198.14, 235.93199999999993, 210.14), 'result': type: IN_PAN, start: 27, end: 37, score: 0.05}, {'boundingBox': (235.96799999999993, 198.14, 242.26799999999994, 210.14), 'result': type: IN_PAN, start: 27, end: 37, score: 0.05}, {'boundingBox': (242.31599999999995, 198.14, 245.07599999999994, 210.14), 'result': type: IN_PAN, start: 27, end: 37, score: 0.05}, {'boundingBox': (245.07599999999994, 198.14, 250.82399999999993, 210.14), 'result': type: IN_PAN, start: 27, end: 37, score: 0.05}, {'boundingBox': (155.84400000000002, 174.26, 158.71200000000002, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (158.71200000000002, 174.26, 164.46, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (164.36400000000003, 174.26, 170.66400000000004, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (170.71200000000002, 174.26, 176.68800000000002, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (176.68800000000002, 174.26, 179.71200000000002, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (179.71200000000002, 174.26, 184.40400000000002, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (184.40400000000002, 174.26, 193.99200000000002, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (193.872, 174.26, 196.632, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (196.632, 174.26, 200.65200000000002, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (200.592, 174.26, 206.89200000000002, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (206.94000000000003, 174.26, 217.66800000000003, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (217.62000000000003, 174.26, 223.59600000000003, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (223.38000000000002, 174.26, 228.57600000000002, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (228.42000000000002, 174.26, 234.168, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (234.168, 174.26, 243.756, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (243.756, 174.26, 250.056, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (250.12800000000001, 174.26, 252.888, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (252.888, 174.26, 258.86400000000003, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (258.86400000000003, 174.26, 261.88800000000003, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (261.88800000000003, 174.26, 266.96400000000006, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (266.808, 174.26, 273.132, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (273.132, 174.26, 282.72, 186.26), 'result': type: EMAIL_ADDRESS, start: 17, end: 39, score: 1.0}, {'boundingBox': (155.84400000000002, 174.26, 158.71200000000002, 186.26), 'result': type: URL, start: 17, end: 24, score: 0.5}, {'boundingBox': (158.71200000000002, 174.26, 164.46, 186.26), 'result': type: URL, start: 17, end: 24, score: 0.5}, {'boundingBox': (164.36400000000003, 174.26, 170.66400000000004, 186.26), 'result': type: URL, start: 17, end: 24, score: 0.5}, {'boundingBox': (170.71200000000002, 174.26, 176.68800000000002, 186.26), 'result': type: URL, start: 17, end: 24, score: 0.5}, {'boundingBox': (176.68800000000002, 174.26, 179.71200000000002, 186.26), 'result': type: URL, start: 17, end: 24, score: 0.5}, {'boundingBox': (179.71200000000002, 174.26, 184.40400000000002, 186.26), 'result': type: URL, start: 17, end: 24, score: 0.5}, {'boundingBox': (184.40400000000002, 174.26, 193.99200000000002, 186.26), 'result': type: URL, start: 17, end: 24, score: 0.5}, {'boundingBox': (217.62000000000003, 174.26, 223.59600000000003, 186.26), 'result': type: URL, start: 28, end: 39, score: 0.5}, {'boundingBox': (223.38000000000002, 174.26, 228.57600000000002, 186.26), 'result': type: URL, start: 28, end: 39, score: 0.5}, {'boundingBox': (228.42000000000002, 174.26, 234.168, 186.26), 'result': type: URL, start: 28, end: 39, score: 0.5}, {'boundingBox': (234.168, 174.26, 243.756, 186.26), 'result': type: URL, start: 28, end: 39, score: 0.5}, {'boundingBox': (243.756, 174.26, 250.056, 186.26), 'result': type: URL, start: 28, end: 39, score: 0.5}, {'boundingBox': (250.12800000000001, 174.26, 252.888, 186.26), 'result': type: URL, start: 28, end: 39, score: 0.5}, {'boundingBox': (252.888, 174.26, 258.86400000000003, 186.26), 'result': type: URL, start: 28, end: 39, score: 0.5}, {'boundingBox': (258.86400000000003, 174.26, 261.88800000000003, 186.26), 'result': type: URL, start: 28, end: 39, score: 0.5}, {'boundingBox': (261.88800000000003, 174.26, 266.96400000000006, 186.26), 'result': type: URL, start: 28, end: 39, score: 0.5}, {'boundingBox': (266.808, 174.26, 273.132, 186.26), 'result': type: URL, start: 28, end: 39, score: 0.5}, {'boundingBox': (273.132, 174.26, 282.72, 186.26), 'result': type: URL, start: 28, end: 39, score: 0.5}, {'boundingBox': (154.99200000000002, 150.5, 158.62800000000001, 162.5), 'result': type: PHONE_NUMBER, start: 15, end: 29, score: 0.75}, {'boundingBox': (158.62800000000001, 150.5, 164.71200000000002, 162.5), 'result': type: PHONE_NUMBER, start: 15, end: 29, score: 0.75}, {'boundingBox': (164.592, 150.5, 170.67600000000002, 162.5), 'result': type: PHONE_NUMBER, start: 15, end: 29, score: 0.75}, {'boundingBox': (170.67600000000002, 150.5, 176.76000000000002, 162.5), 'result': type: PHONE_NUMBER, start: 15, end: 29, score: 0.75}, {'boundingBox': (176.82000000000002, 150.5, 180.45600000000002, 162.5), 'result': type: PHONE_NUMBER, start: 15, end: 29, score: 0.75}, {'boundingBox': (180.45600000000002, 150.5, 183.168, 162.5), 'result': type: PHONE_NUMBER, start: 15, end: 29, score: 0.75}, {'boundingBox': (183.168, 150.5, 189.252, 162.5), 'result': type: PHONE_NUMBER, start: 15, end: 29, score: 0.75}, {'boundingBox': (189.18, 150.5, 195.264, 162.5), 'result': type: PHONE_NUMBER, start: 15, end: 29, score: 0.75}, {'boundingBox': (195.264, 150.5, 201.348, 162.5), 'result': type: PHONE_NUMBER, start: 15, end: 29, score: 0.75}, {'boundingBox': (201.53, 150.5, 205.202, 162.5), 'result': type: PHONE_NUMBER, start: 15, end: 29, score: 0.75}, {'boundingBox': (205.25, 150.5, 211.334, 162.5), 'result': type: PHONE_NUMBER, start: 15, end: 29, score: 0.75}, {'boundingBox': (211.25, 150.5, 217.334, 162.5), 'result': type: PHONE_NUMBER, start: 15, end: 29, score: 0.75}, {'boundingBox': (217.334, 150.5, 223.418, 162.5), 'result': type: PHONE_NUMBER, start: 15, end: 29, score: 0.75}, {'boundingBox': (223.478, 150.5, 229.562, 162.5), 'result': type: PHONE_NUMBER, start: 15, end: 29, score: 0.75}, {'boundingBox': (163.644, 79.104, 169.728, 91.104), 'result': type: DATE_TIME, start: 17, end: 26, score: 0.85}, {'boundingBox': (169.728, 79.104, 175.812, 91.104), 'result': type: DATE_TIME, start: 17, end: 26, score: 0.85}, {'boundingBox': (175.872, 79.104, 181.95600000000002, 91.104), 'result': type: DATE_TIME, start: 17, end: 26, score: 0.85}, {'boundingBox': (181.872, 79.104, 187.95600000000002, 91.104), 'result': type: DATE_TIME, start: 17, end: 26, score: 0.85}, {'boundingBox': (187.95600000000002, 79.104, 194.04000000000002, 91.104), 'result': type: DATE_TIME, start: 17, end: 26, score: 0.85}, {'boundingBox': (193.99200000000002, 79.104, 200.07600000000002, 91.104), 'result': type: DATE_TIME, start: 17, end: 26, score: 0.85}, {'boundingBox': (200.07600000000002, 79.104, 206.16000000000003, 91.104), 'result': type: DATE_TIME, start: 17, end: 26, score: 0.85}, {'boundingBox': (206.22000000000003, 79.104, 212.30400000000003, 91.104), 'result': type: DATE_TIME, start: 17, end: 26, score: 0.85}, {'boundingBox': (212.30400000000003, 79.104, 218.38800000000003, 91.104), 'result': type: DATE_TIME, start: 17, end: 26, score: 0.85}, {'boundingBox': (163.644, 79.104, 169.728, 91.104), 'result': type: US_BANK_NUMBER, start: 17, end: 26, score: 0.4}, {'boundingBox': (169.728, 79.104, 175.812, 91.104), 'result': type: US_BANK_NUMBER, start: 17, end: 26, score: 0.4}, {'boundingBox': (175.872, 79.104, 181.95600000000002, 91.104), 'result': type: US_BANK_NUMBER, start: 17, end: 26, score: 0.4}, {'boundingBox': (181.872, 79.104, 187.95600000000002, 91.104), 'result': type: US_BANK_NUMBER, start: 17, end: 26, score: 0.4}, {'boundingBox': (187.95600000000002, 79.104, 194.04000000000002, 91.104), 'result': type: US_BANK_NUMBER, start: 17, end: 26, score: 0.4}, {'boundingBox': (193.99200000000002, 79.104, 200.07600000000002, 91.104), 'result': type: US_BANK_NUMBER, start: 17, end: 26, score: 0.4}, {'boundingBox': (200.07600000000002, 79.104, 206.16000000000003, 91.104), 'result': type: US_BANK_NUMBER, start: 17, end: 26, score: 0.4}, {'boundingBox': (206.22000000000003, 79.104, 212.30400000000003, 91.104), 'result': type: US_BANK_NUMBER, start: 17, end: 26, score: 0.4}, {'boundingBox': (212.30400000000003, 79.104, 218.38800000000003, 91.104), 'result': type: US_BANK_NUMBER, start: 17, end: 26, score: 0.4}, {'boundingBox': (163.644, 79.104, 169.728, 91.104), 'result': type: US_PASSPORT, start: 17, end: 26, score: 0.05}, {'boundingBox': (169.728, 79.104, 175.812, 91.104), 'result': type: US_PASSPORT, start: 17, end: 26, score: 0.05}, {'boundingBox': (175.872, 79.104, 181.95600000000002, 91.104), 'result': type: US_PASSPORT, start: 17, end: 26, score: 0.05}, {'boundingBox': (181.872, 79.104, 187.95600000000002, 91.104), 'result': type: US_PASSPORT, start: 17, end: 26, score: 0.05}, {'boundingBox': (187.95600000000002, 79.104, 194.04000000000002, 91.104), 'result': type: US_PASSPORT, start: 17, end: 26, score: 0.05}, {'boundingBox': (193.99200000000002, 79.104, 200.07600000000002, 91.104), 'result': type: US_PASSPORT, start: 17, end: 26, score: 0.05}, {'boundingBox': (200.07600000000002, 79.104, 206.16000000000003, 91.104), 'result': type: US_PASSPORT, start: 17, end: 26, score: 0.05}, {'boundingBox': (206.22000000000003, 79.104, 212.30400000000003, 91.104), 'result': type: US_PASSPORT, start: 17, end: 26, score: 0.05}, {'boundingBox': (212.30400000000003, 79.104, 218.38800000000003, 91.104), 'result': type: US_PASSPORT, start: 17, end: 26, score: 0.05}, {'boundingBox': (163.644, 79.104, 169.728, 91.104), 'result': type: US_DRIVER_LICENSE, start: 17, end: 26, score: 0.01}, {'boundingBox': (169.728, 79.104, 175.812, 91.104), 'result': type: US_DRIVER_LICENSE, start: 17, end: 26, score: 0.01}, {'boundingBox': (175.872, 79.104, 181.95600000000002, 91.104), 'result': type: US_DRIVER_LICENSE, start: 17, end: 26, score: 0.01}, {'boundingBox': (181.872, 79.104, 187.95600000000002, 91.104), 'result': type: US_DRIVER_LICENSE, start: 17, end: 26, score: 0.01}, {'boundingBox': (187.95600000000002, 79.104, 194.04000000000002, 91.104), 'result': type: US_DRIVER_LICENSE, start: 17, end: 26, score: 0.01}, {'boundingBox': (193.99200000000002, 79.104, 200.07600000000002, 91.104), 'result': type: US_DRIVER_LICENSE, start: 17, end: 26, score: 0.01}, {'boundingBox': (200.07600000000002, 79.104, 206.16000000000003, 91.104), 'result': type: US_DRIVER_LICENSE, start: 17, end: 26, score: 0.01}, {'boundingBox': (206.22000000000003, 79.104, 212.30400000000003, 91.104), 'result': type: US_DRIVER_LICENSE, start: 17, end: 26, score: 0.01}, {'boundingBox': (212.30400000000003, 79.104, 218.38800000000003, 91.104), 'result': type: US_DRIVER_LICENSE, start: 17, end: 26, score: 0.01}, {'boundingBox': (107.02800000000002, 681.82, 112.88400000000001, 693.82), 'result': type: PERSON, start: 6, end: 19, score: 0.85}, {'boundingBox': (112.88400000000001, 681.82, 122.47200000000001, 693.82), 'result': type: PERSON, start: 6, end: 19, score: 0.85}, {'boundingBox': (122.47200000000001, 681.82, 125.23200000000001, 693.82), 'result': type: PERSON, start: 6, end: 19, score: 0.85}, {'boundingBox': (125.23200000000001, 681.82, 127.99200000000002, 693.82), 'result': type: PERSON, start: 6, end: 19, score: 0.85}, {'boundingBox': (127.99200000000002, 681.82, 133.42800000000003, 693.82), 'result': type: PERSON, start: 6, end: 19, score: 0.85}, {'boundingBox': (133.428, 681.82, 136.14, 693.82), 'result': type: PERSON, start: 6, end: 19, score: 0.85}, {'boundingBox': (136.17600000000002, 681.82, 140.00400000000002, 693.82), 'result': type: PERSON, start: 6, end: 19, score: 0.85}, {'boundingBox': (140.00400000000002, 681.82, 146.32800000000003, 693.82), 'result': type: PERSON, start: 6, end: 19, score: 0.85}, {'boundingBox': (146.25600000000003, 681.82, 152.55600000000004, 693.82), 'result': type: PERSON, start: 6, end: 19, score: 0.85}, {'boundingBox': (152.604, 681.82, 158.90400000000002, 693.82), 'result': type: PERSON, start: 6, end: 19, score: 0.85}, {'boundingBox': (158.952, 681.82, 163.644, 693.82), 'result': type: PERSON, start: 6, end: 19, score: 0.85}, {'boundingBox': (163.644, 681.82, 169.96800000000002, 693.82), 'result': type: PERSON, start: 6, end: 19, score: 0.85}, {'boundingBox': (169.872, 681.82, 176.17200000000003, 693.82), 'result': type: PERSON, start: 6, end: 19, score: 0.85}, {'boundingBox': (191.808, 586.54, 194.56799999999998, 598.54), 'result': type: DATE_TIME, start: 21, end: 31, score: 0.85}, {'boundingBox': (194.45999999999998, 586.54, 199.884, 598.54), 'result': type: DATE_TIME, start: 21, end: 31, score: 0.85}, {'boundingBox': (199.73999999999998, 586.54, 205.71599999999998, 598.54), 'result': type: DATE_TIME, start: 21, end: 31, score: 0.85}, {'boundingBox': (205.71599999999998, 586.54, 208.42799999999997, 598.54), 'result': type: DATE_TIME, start: 21, end: 31, score: 0.85}, {'boundingBox': (208.48799999999997, 586.54, 214.78799999999998, 598.54), 'result': type: DATE_TIME, start: 21, end: 31, score: 0.85}, {'boundingBox': (214.83599999999998, 586.54, 220.58399999999997, 598.54), 'result': type: DATE_TIME, start: 21, end: 31, score: 0.85}, {'boundingBox': (220.35599999999997, 586.54, 225.79199999999997, 598.54), 'result': type: DATE_TIME, start: 21, end: 31, score: 0.85}, {'boundingBox': (225.63599999999997, 586.54, 230.32799999999997, 598.54), 'result': type: DATE_TIME, start: 21, end: 31, score: 0.85}, {'boundingBox': (230.32799999999997, 586.54, 232.97999999999996, 598.54), 'result': type: DATE_TIME, start: 21, end: 31, score: 0.85}, {'boundingBox': (232.97999999999996, 586.54, 235.69199999999995, 598.54), 'result': type: DATE_TIME, start: 21, end: 31, score: 0.85}, {'boundingBox': (110.28, 348.65, 117.228, 360.65), 'result': type: DATE_TIME, start: 7, end: 21, score: 0.85}, {'boundingBox': (117.22800000000001, 348.65, 123.528, 360.65), 'result': type: DATE_TIME, start: 7, end: 21, score: 0.85}, {'boundingBox': (123.58800000000001, 348.65, 127.77600000000001, 360.65), 'result': type: DATE_TIME, start: 7, end: 21, score: 0.85}, {'boundingBox': (127.77600000000001, 348.65, 130.536, 360.65), 'result': type: DATE_TIME, start: 7, end: 21, score: 0.85}, {'boundingBox': (130.536, 348.65, 133.296, 360.65), 'result': type: DATE_TIME, start: 7, end: 21, score: 0.85}, {'boundingBox': (133.18800000000002, 348.65, 135.9, 360.65), 'result': type: DATE_TIME, start: 7, end: 21, score: 0.85}, {'boundingBox': (135.936, 348.65, 142.02, 360.65), 'result': type: DATE_TIME, start: 7, end: 21, score: 0.85}, {'boundingBox': (142.02, 348.65, 148.104, 360.65), 'result': type: DATE_TIME, start: 7, end: 21, score: 0.85}, {'boundingBox': (148.16400000000002, 348.65, 151.16400000000002, 360.65), 'result': type: DATE_TIME, start: 7, end: 21, score: 0.85}, {'boundingBox': (151.044, 348.65, 153.756, 360.65), 'result': type: DATE_TIME, start: 7, end: 21, score: 0.85}, {'boundingBox': (153.79200000000003, 348.65, 159.87600000000003, 360.65), 'result': type: DATE_TIME, start: 7, end: 21, score: 0.85}, {'boundingBox': (159.87600000000003, 348.65, 165.96000000000004, 360.65), 'result': type: DATE_TIME, start: 7, end: 21, score: 0.85}, {'boundingBox': (165.91200000000003, 348.65, 171.99600000000004, 360.65), 'result': type: DATE_TIME, start: 7, end: 21, score: 0.85}, {'boundingBox': (171.99600000000004, 348.65, 178.08000000000004, 360.65), 'result': type: DATE_TIME, start: 7, end: 21, score: 0.85}, {'boundingBox': (72.024, 324.77, 75.852, 336.77), 'result': type: PERSON, start: 0, end: 10, score: 0.85}, {'boundingBox': (75.852, 324.77, 82.176, 336.77), 'result': type: PERSON, start: 0, end: 10, score: 0.85}, {'boundingBox': (82.212, 324.77, 88.512, 336.77), 'result': type: PERSON, start: 0, end: 10, score: 0.85}, {'boundingBox': (88.56, 324.77, 94.86, 336.77), 'result': type: PERSON, start: 0, end: 10, score: 0.85}, {'boundingBox': (94.8, 324.77, 97.512, 336.77), 'result': type: PERSON, start: 0, end: 10, score: 0.85}, {'boundingBox': (97.548, 324.77, 103.056, 336.77), 'result': type: PERSON, start: 0, end: 10, score: 0.85}, {'boundingBox': (103.056, 324.77, 112.644, 336.77), 'result': type: PERSON, start: 0, end: 10, score: 0.85}, {'boundingBox': (112.644, 324.77, 115.40400000000001, 336.77), 'result': type: PERSON, start: 0, end: 10, score: 0.85}, {'boundingBox': (115.404, 324.77, 119.42399999999999, 336.77), 'result': type: PERSON, start: 0, end: 10, score: 0.85}, {'boundingBox': (119.388, 324.77, 125.688, 336.77), 'result': type: PERSON, start: 0, end: 10, score: 0.85}, {'boundingBox': (158.916, 301.01, 165.312, 313.01), 'result': type: LOCATION, start: 18, end: 28, score: 0.85}, {'boundingBox': (165.312, 301.01, 171.06, 313.01), 'result': type: LOCATION, start: 18, end: 28, score: 0.85}, {'boundingBox': (171.06, 301.01, 173.82, 313.01), 'result': type: LOCATION, start: 18, end: 28, score: 0.85}, {'boundingBox': (173.82, 301.01, 176.57999999999998, 313.01), 'result': type: LOCATION, start: 18, end: 28, score: 0.85}, {'boundingBox': (176.58, 301.01, 180.24, 313.01), 'result': type: LOCATION, start: 18, end: 28, score: 0.85}, {'boundingBox': (179.94, 301.01, 186.264, 313.01), 'result': type: LOCATION, start: 18, end: 28, score: 0.85}, {'boundingBox': (186.264, 301.01, 190.452, 313.01), 'result': type: LOCATION, start: 18, end: 28, score: 0.85}, {'boundingBox': (190.38, 301.01, 196.68, 313.01), 'result': type: LOCATION, start: 18, end: 28, score: 0.85}, {'boundingBox': (196.728, 301.01, 199.488, 313.01), 'result': type: LOCATION, start: 18, end: 28, score: 0.85}, {'boundingBox': (199.488, 301.01, 205.236, 313.01), 'result': type: LOCATION, start: 18, end: 28, score: 0.85}, {'boundingBox': (211.13, 301.01, 216.17, 313.01), 'result': type: LOCATION, start: 30, end: 41, score: 0.85}, {'boundingBox': (216.17, 301.01, 222.494, 313.01), 'result': type: LOCATION, start: 30, end: 41, score: 0.85}, {'boundingBox': (222.494, 301.01, 227.186, 313.01), 'result': type: LOCATION, start: 30, end: 41, score: 0.85}, {'boundingBox': (227.09, 301.01, 229.802, 313.01), 'result': type: LOCATION, start: 30, end: 41, score: 0.85}, {'boundingBox': (229.838, 301.01, 236.786, 313.01), 'result': type: LOCATION, start: 30, end: 41, score: 0.85}, {'boundingBox': (236.786, 301.01, 243.086, 313.01), 'result': type: LOCATION, start: 30, end: 41, score: 0.85}, {'boundingBox': (243.14600000000002, 301.01, 248.798, 313.01), 'result': type: LOCATION, start: 30, end: 41, score: 0.85}, {'boundingBox': (248.666, 301.01, 254.642, 313.01), 'result': type: LOCATION, start: 30, end: 41, score: 0.85}, {'boundingBox': (254.642, 301.01, 257.402, 313.01), 'result': type: LOCATION, start: 30, end: 41, score: 0.85}, {'boundingBox': (257.402, 301.01, 263.378, 313.01), 'result': type: LOCATION, start: 30, end: 41, score: 0.85}, {'boundingBox': (263.414, 301.01, 268.106, 313.01), 'result': type: LOCATION, start: 30, end: 41, score: 0.85}, {'boundingBox': (72.024, 253.37, 78.42, 265.37), 'result': type: LOCATION, start: 0, end: 15, score: 0.85}, {'boundingBox': (78.384, 253.37, 84.132, 265.37), 'result': type: LOCATION, start: 0, end: 15, score: 0.85}, {'boundingBox': (84.132, 253.37, 86.89200000000001, 265.37), 'result': type: LOCATION, start: 0, end: 15, score: 0.85}, {'boundingBox': (86.892, 253.37, 89.652, 265.37), 'result': type: LOCATION, start: 0, end: 15, score: 0.85}, {'boundingBox': (89.652, 253.37, 93.312, 265.37), 'result': type: LOCATION, start: 0, end: 15, score: 0.85}, {'boundingBox': (93.024, 253.37, 99.348, 265.37), 'result': type: LOCATION, start: 0, end: 15, score: 0.85}, {'boundingBox': (99.348, 253.37, 103.536, 265.37), 'result': type: LOCATION, start: 0, end: 15, score: 0.85}, {'boundingBox': (103.572, 253.37, 109.872, 265.37), 'result': type: LOCATION, start: 0, end: 15, score: 0.85}, {'boundingBox': (109.92, 253.37, 112.68, 265.37), 'result': type: LOCATION, start: 0, end: 15, score: 0.85}, {'boundingBox': (112.68, 253.37, 118.42800000000001, 265.37), 'result': type: LOCATION, start: 0, end: 15, score: 0.85}, {'boundingBox': (118.428, 253.37, 121.14, 265.37), 'result': type: LOCATION, start: 0, end: 15, score: 0.85}, {'boundingBox': (121.188, 253.37, 127.584, 265.37), 'result': type: LOCATION, start: 0, end: 15, score: 0.85}, {'boundingBox': (127.548, 253.37, 130.308, 265.37), 'result': type: LOCATION, start: 0, end: 15, score: 0.85}, {'boundingBox': (130.308, 253.37, 134.328, 265.37), 'result': type: LOCATION, start: 0, end: 15, score: 0.85}, {'boundingBox': (134.268, 253.37, 139.704, 265.37), 'result': type: LOCATION, start: 0, end: 15, score: 0.85}, {'boundingBox': (72.024, 253.37, 78.42, 265.37), 'result': type: IN_PAN, start: 0, end: 10, score: 0.05}, {'boundingBox': (78.384, 253.37, 84.132, 265.37), 'result': type: IN_PAN, start: 0, end: 10, score: 0.05}, {'boundingBox': (84.132, 253.37, 86.89200000000001, 265.37), 'result': type: IN_PAN, start: 0, end: 10, score: 0.05}, {'boundingBox': (86.892, 253.37, 89.652, 265.37), 'result': type: IN_PAN, start: 0, end: 10, score: 0.05}, {'boundingBox': (89.652, 253.37, 93.312, 265.37), 'result': type: IN_PAN, start: 0, end: 10, score: 0.05}, {'boundingBox': (93.024, 253.37, 99.348, 265.37), 'result': type: IN_PAN, start: 0, end: 10, score: 0.05}, {'boundingBox': (99.348, 253.37, 103.536, 265.37), 'result': type: IN_PAN, start: 0, end: 10, score: 0.05}, {'boundingBox': (103.572, 253.37, 109.872, 265.37), 'result': type: IN_PAN, start: 0, end: 10, score: 0.05}, {'boundingBox': (109.92, 253.37, 112.68, 265.37), 'result': type: IN_PAN, start: 0, end: 10, score: 0.05}, {'boundingBox': (112.68, 253.37, 118.42800000000001, 265.37), 'result': type: IN_PAN, start: 0, end: 10, score: 0.05}]

@omri374
Copy link
Contributor

omri374 commented Apr 14, 2024

Thanks. I'm not sure where the issue is, as I don't have access to the original PDF. What I would suggest doing, is to first limit the results by applying a score_threshold when calling the .analyze method. This would remove all the false positive with very low scores (e.g. predictions with 0.01 confidences)

Could you please share your input.pdf file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants