Skip to content

Tesseract 3 and Tika

Eric Pugh edited this page Oct 29, 2019 · 1 revision

Tesseract 4 is a major improvement out there, but not everyone has upgraded to it (including one of our customers), so we investigated using Tesseract 3. It turns out that the HOCR support in Tesseract 3 is identical to Tesseract 4, which means that Tika doesn't mind that it's an older version.

Want to try out Tesseract 3 inside of Tika? Checkout https://github.com/o19s/pdf-discovery-demo/tree/master/tika-server-tesseract-3 docker image, which is based on the https://logicalspark.github.io/docker-tikaserver/ project.