Performance Degradation After Finetuning #15

AnhaoROMA · 2021-04-30T07:23:51Z

Environment

Tesseract Version: v4.0.0.20181030
Platform: Ubuntu16

Motivation Introduction

There are some articles, which contain both English alphabets and Greek alphabets, need to be OCRed. And I turned to Tesseract.

After installing Tesseract successfully, I opened a terminal and ran the command tesseract detector_sample_1.png result -l eng+grc, and get result.txt as a result.

The original image named "detector_sample_1.png" is shown as bellow.

And the result.txt is shown as bellow too.

I found that Tesseract works quite well, if disregarded the content in red block(s).

Actually, Greek alphabets do not appear too frequently in these articles. So I came up with the idea that I should retrain/finetune the existing eng.traineddata.

Therefore, I resorted to your code.

Description of My Experiment Process

After reading your README.md, I think I should firstly run 8-makedata_layernew.sh and 9-layernew.sh later. (Should do some modification certainly!)

In that I need to finetune the eng.traineddata with Greek alphabets, I prepared a training_text
eng.anhao.training_text.txt. (I need to change the extension to .txt in that I can not upload the file with extension .training_text.) And I only cat ../langdata/eng/eng.training_text ../langdata/eng/eng.anhao.training_text >../langdata/eng/eng.layer.training_text (in 8-makedata_layernew.sh). What is more, I prepared a new test file
eng.layertest.training_text.txt.

Then I ran ./8-makedata_layernew.sh and 9-layernew.sh. Afterwards, I get the eng_layer.traineddata.

Experiment Result

It is disappointed that the performance degraded, although the eng_layer.traineddata can recognize some Greek alphabets.

Conclusion

I tried to extend the existing model "eng.traineddata" with Greek alphabets, and I tried your code. But the result is disappointing. So I hope you could help me.

ttbuffey · 2021-05-04T02:03:13Z

@Shreeshrii I also come across the similar issue, could you please help to address?

This was referenced May 12, 2021

Greek alphabets like Θ can't be recognized tesseract-ocr/tesseract#3379

Closed

Performance Degradation After Finetuning tesseract-ocr/tesseract#3427

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Degradation After Finetuning #15

Performance Degradation After Finetuning #15

AnhaoROMA commented Apr 30, 2021

ttbuffey commented May 4, 2021

Performance Degradation After Finetuning #15

Performance Degradation After Finetuning #15

Comments

AnhaoROMA commented Apr 30, 2021

Environment

Motivation Introduction

Description of My Experiment Process

Experiment Result

Conclusion

ttbuffey commented May 4, 2021