Bug in multi-column page with dense text #1795

vikasr111 · 2024-11-25T06:25:49Z

Bug description

I am trying to use DocTR for a document which as texts arranged in two columns and has dense texts. I noticed that the text detection is incorrect. It identified multiple overlapping text blocks because of which the text output is also incorrect.

Here's the original document:

Here's the OCR plot:

Here's the segmentation result:

How to address it?

Code snippet to reproduce the bug

from doctr.io import DocumentFile
# PDF
pdf_doc = DocumentFile.from_pdf("path/to/your/doc.pdf")

Error traceback

No error but the output is incorrect

Environment

python 3.10

Deep Learning backend

Torch

The text was updated successfully, but these errors were encountered:

felixdittrich92 · 2024-11-25T07:34:08Z

Hi @vikasr111 👋,

Thanks for reporting 👍

It's already planned to retrain all detection models with our new augmentation pipeline and an extended dataset for pretraining to make them more robust.

Could you please give "db_mobilenet_v3_large" as detection arch a try (this model is already pretrained with our new augmentation pipeline).

Additional you can tweak a bit around with the bin_thresh and box_thresh values (lower score -> more detected / less accure | higher score -> possible less detected / more accure)
https://mindee.github.io/doctr/using_doctr/using_models.html#advanced-options

predictor = ocr_predictor(
    det_arch="db_mobilenet_v3_large",
    reco_arch="parseq",
    pretrained=True,
    preserve_aspect_ratio=False,
    symmetric_pad=False,
    )

predictor.det_predictor.model.postprocessor.bin_thresh = 0.35
predictor.det_predictor.model.postprocessor.box_thresh = 0.3

result = predictor(doc)
result.show()

felixdittrich92 · 2024-11-25T07:43:29Z

CC @odulcy-mindee A good sign that the new augmentation pipeline improves our models ^^
Nevertheless, I think we need to expand the dataset a bit.

vikasr111 added the type: bug Something isn't working label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in multi-column page with dense text #1795

Bug in multi-column page with dense text #1795

vikasr111 commented Nov 25, 2024

felixdittrich92 commented Nov 25, 2024 •

edited

Loading

felixdittrich92 commented Nov 25, 2024

Bug in multi-column page with dense text #1795

Bug in multi-column page with dense text #1795

Comments

vikasr111 commented Nov 25, 2024

Bug description

Code snippet to reproduce the bug

Error traceback

Environment

Deep Learning backend

felixdittrich92 commented Nov 25, 2024 • edited Loading

felixdittrich92 commented Nov 25, 2024

felixdittrich92 commented Nov 25, 2024 •

edited

Loading