Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase multithreading processing capability #72

Open
chenziliang0725 opened this issue Apr 28, 2024 · 4 comments
Open

Increase multithreading processing capability #72

chenziliang0725 opened this issue Apr 28, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@chenziliang0725
Copy link

chenziliang0725 commented Apr 28, 2024

Tesseract and poppler only produce pages one by one now. When there are dozens of pages, it work slowly.
Can we increase multithreading processing capability

@aborel aborel added the enhancement New feature or request label Apr 30, 2024
@aborel
Copy link
Collaborator

aborel commented May 8, 2024

There could be ways to do this, at least for tesseract. I'll take a look.

@aborel aborel self-assigned this May 8, 2024
@stweil
Copy link
Member

stweil commented May 8, 2024

Currently the code runs the Tesseract executable with a list of page images. Then Tesseract processes those images one by one which takes some time.

zotero-ocr could accelerate the recognition by running several parallel Tesseract processes, but that would increase the complexity because it would require an additional processing step to combine the results of the different Tesseract processes.

I think it would be easier to add a reasonable multithreading to the Tesseract code. The current multithreading in Tesseract is not helpful, but multithreading on the page level would have a large benefit.

@aborel
Copy link
Collaborator

aborel commented May 8, 2024

I agree. My plan was to investigate the current Tesseract situation before writing any code here, so thanks for this input.

@aborel
Copy link
Collaborator

aborel commented May 9, 2024

The way I see it, the ideal situation for us would be if someone implemented this Tesseract issue tesseract-ocr/tesseract#3750 . Then we'd get the functionality at a minimal cost (maybe just adding a preference for the number of threads).

Sadly, the Tesseract issue is 2 years old with no activity in sight, so I'm not really confident it will happen soon. I don't have the proper skill set to contribute on that side, unfortunately. However, I think the added complexity to implement this within the Zotero-OCR code might be manageable... I'd like to try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants