You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tesseract and poppler only produce pages one by one now. When there are dozens of pages, it work slowly.
Can we increase multithreading processing capability
The text was updated successfully, but these errors were encountered:
Currently the code runs the Tesseract executable with a list of page images. Then Tesseract processes those images one by one which takes some time.
zotero-ocr could accelerate the recognition by running several parallel Tesseract processes, but that would increase the complexity because it would require an additional processing step to combine the results of the different Tesseract processes.
I think it would be easier to add a reasonable multithreading to the Tesseract code. The current multithreading in Tesseract is not helpful, but multithreading on the page level would have a large benefit.
The way I see it, the ideal situation for us would be if someone implemented this Tesseract issue tesseract-ocr/tesseract#3750 . Then we'd get the functionality at a minimal cost (maybe just adding a preference for the number of threads).
Sadly, the Tesseract issue is 2 years old with no activity in sight, so I'm not really confident it will happen soon. I don't have the proper skill set to contribute on that side, unfortunately. However, I think the added complexity to implement this within the Zotero-OCR code might be manageable... I'd like to try.
Tesseract and poppler only produce pages one by one now. When there are dozens of pages, it work slowly.
Can we increase multithreading processing capability
The text was updated successfully, but these errors were encountered: