Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in multi-column page with dense text #1795

Open
vikasr111 opened this issue Nov 25, 2024 · 2 comments
Open

Bug in multi-column page with dense text #1795

vikasr111 opened this issue Nov 25, 2024 · 2 comments
Labels
type: bug Something isn't working

Comments

@vikasr111
Copy link

Bug description

I am trying to use DocTR for a document which as texts arranged in two columns and has dense texts. I noticed that the text detection is incorrect. It identified multiple overlapping text blocks because of which the text output is also incorrect.

Here's the original document:
Screenshot 2024-11-25 at 11 45 55 AM

Here's the OCR plot:
Screenshot 2024-11-25 at 11 45 15 AM

Here's the segmentation result:
Screenshot 2024-11-25 at 11 45 28 AM

How to address it?

Code snippet to reproduce the bug

from doctr.io import DocumentFile
# PDF
pdf_doc = DocumentFile.from_pdf("path/to/your/doc.pdf")

Error traceback

No error but the output is incorrect

Environment

python 3.10

Deep Learning backend

Torch

@vikasr111 vikasr111 added the type: bug Something isn't working label Nov 25, 2024
@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Nov 25, 2024

Hi @vikasr111 👋,

Thanks for reporting 👍

It's already planned to retrain all detection models with our new augmentation pipeline and an extended dataset for pretraining to make them more robust.

Could you please give "db_mobilenet_v3_large" as detection arch a try (this model is already pretrained with our new augmentation pipeline).

Additional you can tweak a bit around with the bin_thresh and box_thresh values (lower score -> more detected / less accure | higher score -> possible less detected / more accure)
https://mindee.github.io/doctr/using_doctr/using_models.html#advanced-options

predictor = ocr_predictor(
    det_arch="db_mobilenet_v3_large",
    reco_arch="parseq",
    pretrained=True,
    preserve_aspect_ratio=False,
    symmetric_pad=False,
    )

predictor.det_predictor.model.postprocessor.bin_thresh = 0.35
predictor.det_predictor.model.postprocessor.box_thresh = 0.3

result = predictor(doc)
result.show()

Screenshot from 2024-11-25 08-33-47

@felixdittrich92
Copy link
Contributor

CC @odulcy-mindee A good sign that the new augmentation pipeline improves our models ^^
Nevertheless, I think we need to expand the dataset a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants