You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are some articles, which contain both English alphabets and Greek alphabets, need to be OCRed. And I turned to Tesseract.
After installing Tesseract successfully, I opened a terminal and ran the command tesseract detector_sample_1.png result -l eng+grc, and get result.txt as a result.
The original image named "detector_sample_1.png" is shown as bellow.
And the result.txt is shown as bellow too.
I found that Tesseract works quite well, if disregarded the content in red block(s).
Actually, Greek alphabets do not appear too frequently in these articles. So I came up with the idea that I should retrain/finetune the existing eng.traineddata.
Therefore, I resorted to your code.
Description of My Experiment Process
After reading your README.md, I think I should firstly run 8-makedata_layernew.sh and 9-layernew.sh later. (Should do some modification certainly!)
In that I need to finetune the eng.traineddata with Greek alphabets, I prepared a training_text eng.anhao.training_text.txt. (I need to change the extension to .txt in that I can not upload the file with extension .training_text.) And I only cat ../langdata/eng/eng.training_text ../langdata/eng/eng.anhao.training_text >../langdata/eng/eng.layer.training_text (in 8-makedata_layernew.sh). What is more, I prepared a new test file eng.layertest.training_text.txt.
Then I ran ./8-makedata_layernew.sh and 9-layernew.sh. Afterwards, I get the eng_layer.traineddata.
Experiment Result
It is disappointed that the performance degraded, although the eng_layer.traineddata can recognize some Greek alphabets.
Conclusion
I tried to extend the existing model "eng.traineddata" with Greek alphabets, and I tried your code. But the result is disappointing. So I hope you could help me.
The text was updated successfully, but these errors were encountered:
Environment
Tesseract Version: v4.0.0.20181030
Platform: Ubuntu16
Motivation Introduction
There are some articles, which contain both English alphabets and Greek alphabets, need to be OCRed. And I turned to Tesseract.
After installing Tesseract successfully, I opened a terminal and ran the command tesseract detector_sample_1.png result -l eng+grc, and get result.txt as a result.
The original image named "detector_sample_1.png" is shown as bellow.
And the result.txt is shown as bellow too.
I found that Tesseract works quite well, if disregarded the content in red block(s).
Actually, Greek alphabets do not appear too frequently in these articles. So I came up with the idea that I should retrain/finetune the existing eng.traineddata.
Therefore, I resorted to your code.
Description of My Experiment Process
After reading your README.md, I think I should firstly run 8-makedata_layernew.sh and 9-layernew.sh later. (Should do some modification certainly!)
In that I need to finetune the eng.traineddata with Greek alphabets, I prepared a training_text
eng.anhao.training_text.txt. (I need to change the extension to .txt in that I can not upload the file with extension .training_text.) And I only cat ../langdata/eng/eng.training_text ../langdata/eng/eng.anhao.training_text >../langdata/eng/eng.layer.training_text (in 8-makedata_layernew.sh). What is more, I prepared a new test file
eng.layertest.training_text.txt.
Then I ran ./8-makedata_layernew.sh and 9-layernew.sh. Afterwards, I get the eng_layer.traineddata.
Experiment Result
It is disappointed that the performance degraded, although the eng_layer.traineddata can recognize some Greek alphabets.
Conclusion
I tried to extend the existing model "eng.traineddata" with Greek alphabets, and I tried your code. But the result is disappointing. So I hope you could help me.
The text was updated successfully, but these errors were encountered: