Determine available languages and provide a choice for them #8

zuphilip · 2019-11-23T11:01:03Z

Currently, we use a fixed language as deu or eng for OCR with Tesseract. But in a lot of cases it is even better to choose script/Latin, or for old texts script/Fraktur. Also other languages or scripts should be available to choose from.

There are several things to consider here:

How can we find out the available languages for the currently installed tesseract? - It is possible to run commands like tesseract --list-langs from the extension, but we cannot access the output or pipe the output somewhere from Zotero. Should we just ship a one-liner script (shell script for linux/mac and bat file for windows) which is then calling the command above and pipe it to a file, which we then can analyze? Other ideas?
It is possible to have some general options and defining a standard model there. In the setting pane you can then also change this model depending on the languages you have installed (see 1.).
It is possible to analyze the language field of each Zotero entry to choose a different option. This would then allow for example to use deu model for German texts and eng model for English texts. However, this might not always be that simple. For example for older German texts one should maybe use script/Fraktur model instead and even the script/Latin model is quite often better for texts including names also in foreign languages etc.
Maybe it is better to ask before each call which language to choose etc. Then you can manually select all the entries which can be recognized by the same language. Moreover, one could possible have some more Tesseract options to toggle on/off etc. What do you think?

CC @stweil @luerhard

The text was updated successfully, but these errors were encountered:

stweil · 2019-11-23T14:41:36Z

Keep it simple. I think it would be sufficient to have a user option (similar to the tesseract path option) for the language / script which is preset to eng (the default language which is always installed). The user would be responsible for installing and selecting the right models, otherwise Tesseract would simply fail with an error.

Latin (or script/Latin, depending on your installation) is a good choice for all texts based on Latin script. Some users might need Cyrillic, Greek, Arabic or other scripts. The user option would also allow setting Latin+Greek+Arabic, for example, so I see no need to ask each time.

luerhard · 2019-11-23T18:29:16Z

Regarding 1.
For Unix-systems it would probably be enough to just run the command
tesseract --list-langs > /path/to/file.txt
to print all the available languages to a file.

If this works fine, one could implement a Dropdownmenu to just select the language. I think that would be enough.

zuphilip · 2019-11-26T21:47:27Z

A simple solution in a free textbox in the new preferences as @stweil suggested is now implemented.

I am aware of the command in tesseract to show all available languages, but I don't see a possibility to call this from Zotero and save its output somewhere. But yeah we could create a file with something like this.

Let us wait a little bit more and in practice how good the simple solution is already working.

zettelberg · 2024-05-28T14:28:21Z

Have had a related problem: not being accustomed to type "deu" but always "de" in similar cases (...which I should have verified by trying "tesseract list-lang" of course...) took me quite a long time to get the solution - also because the system doesn't throw any error messages in that case (sadly!). A dropdown-box (or simply: more examples!) would have helped a lot!

zuphilip mentioned this issue Nov 24, 2019

Add preferences #10

Merged

zuphilip added enhancement New feature or request help wanted Extra attention is needed labels Jan 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine available languages and provide a choice for them #8

Determine available languages and provide a choice for them #8

zuphilip commented Nov 23, 2019

stweil commented Nov 23, 2019

luerhard commented Nov 23, 2019

zuphilip commented Nov 26, 2019

zettelberg commented May 28, 2024

Determine available languages and provide a choice for them #8

Determine available languages and provide a choice for them #8

Comments

zuphilip commented Nov 23, 2019

stweil commented Nov 23, 2019

luerhard commented Nov 23, 2019

zuphilip commented Nov 26, 2019

zettelberg commented May 28, 2024