Increase multithreading processing capability #72

chenziliang0725 · 2024-04-28T08:16:24Z

Tesseract and poppler only produce pages one by one now. When there are dozens of pages, it work slowly.
Can we increase multithreading processing capability

aborel · 2024-05-08T04:38:34Z

There could be ways to do this, at least for tesseract. I'll take a look.

stweil · 2024-05-08T06:06:22Z

Currently the code runs the Tesseract executable with a list of page images. Then Tesseract processes those images one by one which takes some time.

zotero-ocr could accelerate the recognition by running several parallel Tesseract processes, but that would increase the complexity because it would require an additional processing step to combine the results of the different Tesseract processes.

I think it would be easier to add a reasonable multithreading to the Tesseract code. The current multithreading in Tesseract is not helpful, but multithreading on the page level would have a large benefit.

aborel · 2024-05-08T06:13:40Z

I agree. My plan was to investigate the current Tesseract situation before writing any code here, so thanks for this input.

aborel · 2024-05-09T08:29:13Z

The way I see it, the ideal situation for us would be if someone implemented this Tesseract issue tesseract-ocr/tesseract#3750 . Then we'd get the functionality at a minimal cost (maybe just adding a preference for the number of threads).

Sadly, the Tesseract issue is 2 years old with no activity in sight, so I'm not really confident it will happen soon. I don't have the proper skill set to contribute on that side, unfortunately. However, I think the added complexity to implement this within the Zotero-OCR code might be manageable... I'd like to try.

aborel added the enhancement New feature or request label Apr 30, 2024

aborel self-assigned this May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase multithreading processing capability #72

Increase multithreading processing capability #72

chenziliang0725 commented Apr 28, 2024 •

edited by stweil

Loading

aborel commented May 8, 2024

stweil commented May 8, 2024

aborel commented May 8, 2024

aborel commented May 9, 2024 •

edited

Loading

Increase multithreading processing capability #72

Increase multithreading processing capability #72

Comments

chenziliang0725 commented Apr 28, 2024 • edited by stweil Loading

aborel commented May 8, 2024

stweil commented May 8, 2024

aborel commented May 8, 2024

aborel commented May 9, 2024 • edited Loading

chenziliang0725 commented Apr 28, 2024 •

edited by stweil

Loading

aborel commented May 9, 2024 •

edited

Loading