Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Degradation After Finetuning #15

Open
AnhaoROMA opened this issue Apr 30, 2021 · 1 comment
Open

Performance Degradation After Finetuning #15

AnhaoROMA opened this issue Apr 30, 2021 · 1 comment

Comments

@AnhaoROMA
Copy link

Environment

Tesseract Version: v4.0.0.20181030
Platform: Ubuntu16

Motivation Introduction

There are some articles, which contain both English alphabets and Greek alphabets, need to be OCRed. And I turned to Tesseract.

After installing Tesseract successfully, I opened a terminal and ran the command tesseract detector_sample_1.png result -l eng+grc, and get result.txt as a result.

The original image named "detector_sample_1.png" is shown as bellow.
detector_sample_1

And the result.txt is shown as bellow too.
result

I found that Tesseract works quite well, if disregarded the content in red block(s).
1

Actually, Greek alphabets do not appear too frequently in these articles. So I came up with the idea that I should retrain/finetune the existing eng.traineddata.

Therefore, I resorted to your code.

Description of My Experiment Process

After reading your README.md, I think I should firstly run 8-makedata_layernew.sh and 9-layernew.sh later. (Should do some modification certainly!)

In that I need to finetune the eng.traineddata with Greek alphabets, I prepared a training_text
eng.anhao.training_text.txt. (I need to change the extension to .txt in that I can not upload the file with extension .training_text.) And I only cat ../langdata/eng/eng.training_text ../langdata/eng/eng.anhao.training_text >../langdata/eng/eng.layer.training_text (in 8-makedata_layernew.sh). What is more, I prepared a new test file
eng.layertest.training_text.txt.

Then I ran ./8-makedata_layernew.sh and 9-layernew.sh. Afterwards, I get the eng_layer.traineddata.

Experiment Result

It is disappointed that the performance degraded, although the eng_layer.traineddata can recognize some Greek alphabets.
3

Conclusion

I tried to extend the existing model "eng.traineddata" with Greek alphabets, and I tried your code. But the result is disappointing. So I hope you could help me.

@ttbuffey
Copy link

ttbuffey commented May 4, 2021

@Shreeshrii I also come across the similar issue, could you please help to address?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants