Inconsistencies in detection and extraction of text using tesseract #4255

saanvib13 · 2024-05-30T12:23:38Z

Your Feature Request

I have provided the image from which I am trying to extract text from, using tesseract ocr.

Along with that, I have also provided the result or the extracted text from the image.

As it can be observed from the images, the extracted text is not very accurate. Negative symbols have been omitted, some undesired characters are also there in the extracted text. (I have marked some of the incorrect results with blue boxes)
I have tried to improve the results by preprocessing and bringing changes in the parameters of the model. I have tried:

binarizing the images
HDR processing of the processes
Even then, such inconsistencies remain.

How to improve the detection and extraction of text in tesseract? I have also tried paddleocr for the same task. Even then, symbols such as euro, some negative signs are not being detected.

zdenop · 2024-05-30T14:42:13Z

What about reading documentation?

saanvib13 · 2024-05-31T12:45:11Z

@zdenop Thank you for your response. I tried each and every step mentioned in this documentation. Even then, some decimal points are being omitted such as 22.5 is being misunderstood as 225. Moreover some numbers and being wrongly detected, such as -9 is being extracted as = ). Some negative symbols are also being omitted.
I have tried preprocessing the images and have implemented the following:

noise removal
canny edge detection
hough line transform
binarization
hdr processing

Pls provide your guidance and help me resolve this issue.

zdenop · 2024-05-31T14:05:53Z

And what did you learn about table recognition?
What forum posts about table recognition, what other issues are stated about table recognition? You should check these sources BEFORE posting the issue.

rmast · 2024-06-16T16:39:42Z

This mod seems to do a slightly better job, still not flawless...

amitdo added the question label May 30, 2024

amitdo added the tables label Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistencies in detection and extraction of text using tesseract #4255

Inconsistencies in detection and extraction of text using tesseract #4255

saanvib13 commented May 30, 2024 •

edited

Loading

zdenop commented May 30, 2024

saanvib13 commented May 31, 2024

zdenop commented May 31, 2024

rmast commented Jun 16, 2024

Inconsistencies in detection and extraction of text using tesseract #4255

Inconsistencies in detection and extraction of text using tesseract #4255

Comments

saanvib13 commented May 30, 2024 • edited Loading

Your Feature Request

zdenop commented May 30, 2024

saanvib13 commented May 31, 2024

zdenop commented May 31, 2024

rmast commented Jun 16, 2024

saanvib13 commented May 30, 2024 •

edited

Loading