Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI: Unexpected handling of parameters using -c VAR=VALUE #4354

Closed
sdubinsky opened this issue Nov 17, 2024 · 27 comments
Closed

CLI: Unexpected handling of parameters using -c VAR=VALUE #4354

sdubinsky opened this issue Nov 17, 2024 · 27 comments

Comments

@sdubinsky
Copy link

sdubinsky commented Nov 17, 2024

Note by @amitdo. The original title was: Inconsistency Between --oem and -c "tessedit_ocr_engine_mode"

Current Behavior

As far as I can tell, these two options should be identical, however the attached image produces different output. With --oem it says 11, with the config var it says 1. I should add that with any --oem value other than 0, it returns '1' not '11'.

Full command:

tesseract -l eng --oem 0 --psm 7  -c "tessedit_char_whitelist=0123456789"  ramdisk/test_0.png  -
tesseract -l eng --psm 7  -c "tessedit_char_whitelist=0123456789" -c "tessedit_ocr_engine_mode=0" ramdisk/test_0.png  -

test_0

Expected Behavior

Both commands should return the same result, ideally 11.

Suggested Fix

They should either return the same information or the docs should be updated to clarify the difference.

tesseract -v

tesseract 5.4.1
leptonica-1.84.1
libgif 5.2.2 : libjpeg 6b (libjpeg-turbo 3.0.2) : libpng 1.6.40 : libtiff 4.6.0 : zlib 1.3.1.zlib-ng : libwebp 1.4.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found libcurl/8.9.1 OpenSSL/3.2.2 zlib/1.3.1.zlib-ng libidn2/2.3.7 nghttp2/1.62.1

Operating System

No response

Other Operating System

Fedora 42.

uname -a

Linux deux-ex 6.11.7-300.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 8 19:23:10 UTC 2024 x86_64 GNU/Linux

Compiler

No response

CPU

Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz (12)

Virtualization / Containers

None.

Other Information

No response

@zdenop

This comment was marked as off-topic.

@sdubinsky

This comment was marked as off-topic.

@zdenop

This comment was marked as off-topic.

@sdubinsky

This comment was marked as off-topic.

@sdubinsky

This comment was marked as off-topic.

@zdenop
Copy link
Contributor

zdenop commented Nov 17, 2024

Which tessdata you use?

@sdubinsky
Copy link
Author

eng.traineddata

@sdubinsky
Copy link
Author

I'm not sure how to find the version

@zdenop
Copy link
Contributor

zdenop commented Nov 17, 2024

Check filesize of eng.traineddata and compare it with https://github.com/tesseract-ocr/tessdata, https://github.com/tesseract-ocr/tessdata_best, https://github.com/tesseract-ocr/tessdata_fast

  • make sure you use the latest tessract (5.5.0)

@sdubinsky
Copy link
Author

I am using the first version linked.

@zdenop
Copy link
Contributor

zdenop commented Nov 17, 2024

OK. I think I know ( ;-) ) the reason:
tessedit_ocr_engine_mode is INIT variable

, INT_INIT_MEMBER(tessedit_ocr_engine_mode, tesseract::OEM_DEFAULT,
"Which OCR engine(s) to run (Tesseract, LSTM, both)."
" Defaults to loading and running the most accurate"
" available.",
this->params())

e.g. it must be set during initialisation of API/language model (see enginemode variable in init ):

tesseract/src/tesseract.cpp

Lines 736 to 741 in 43b8d74

const int init_failed = api.Init(datapath, lang, enginemode, &(argv[arg_i]), argc - arg_i,
&vars_vec, &vars_values, false);
if (!SetVariablesFromCLArgs(api, argc, argv)) {
return EXIT_FAILURE;
}

But variables speficied via "-c" argument are setup by function SetVariablesFromCLArgs after init phase. E.g. -c "tessedit_ocr_engine_mode=0" has no effect and OEM_DEFAULT is used....

@stweil
Copy link
Contributor

stweil commented Nov 17, 2024

Yes, that's the right explanation.

@sdubinsky
Copy link
Author

Thanks!

@amitdo
Copy link
Collaborator

amitdo commented Nov 18, 2024

7dcfd02

@sdubinsky,
Congrats! You've just found a 11.5 years old bug :-)

@amitdo amitdo changed the title Inconsistency Between --oem and -c "tessedit_ocr_engine_mode" CLI: Wrong handling of parameters using -c VT=VALUE Nov 18, 2024
@amitdo amitdo changed the title CLI: Wrong handling of parameters using -c VT=VALUE CLI: Wrong handling of parameters using -c VAR=VALUE Nov 18, 2024
@zdenop
Copy link
Contributor

zdenop commented Nov 20, 2024

@amitdo : why do you claim it is a bug? Can you elaborate on this?

@amitdo
Copy link
Collaborator

amitdo commented Nov 20, 2024

@zdenop,
Is your question is about the referenced commit or in general?

@zdenop
Copy link
Contributor

zdenop commented Nov 20, 2024

@amitdo : My question is about your claims (bug, wrong handling of parameters).

@amitdo
Copy link
Collaborator

amitdo commented Nov 20, 2024

You already analyzed what's wrong in the code which makes any init parameter set with -c a no-op.

I don't see how this can not be classified as a bug.

@stweil
Copy link
Contributor

stweil commented Nov 20, 2024

Not any. It depends on the parameter. Many parameters are used after init, and these parameters work fine. Examples: debug output, thresholding, ...

@stweil stweil changed the title CLI: Wrong handling of parameters using -c VAR=VALUE CLI: Unexpected handling of parameters using -c VAR=VALUE Nov 20, 2024
@amitdo
Copy link
Collaborator

amitdo commented Nov 20, 2024

Yes, I know, but 'wrong handling of parameters' is still true, because the code wrongly defers the task of parsing the -c VAR=VALUE (and pushing them to vectors to be used by init()).

@stweil
Copy link
Contributor

stweil commented Nov 20, 2024

So ideally Tesseract should

  • Set all parameters to default values.
  • Optionally read parameter values from a global configuration file (missing implementation).
  • Optionally read parameter values from a user's configuration file (missing implementation).
  • Apply parameters which are given with -c on the command line. That's currently wrong.
  • Set parameters which are implicitly given with command line arguments like --psm or --dpi.
  • Read parameters given by config files on the command line.
  • Run init.

@amitdo
Copy link
Collaborator

amitdo commented Nov 20, 2024

Using:

tesseract in.tif out pdf

Works fine when the config file is in the right path.

@amitdo
Copy link
Collaborator

amitdo commented Nov 20, 2024

I mean, the config files are already given to init() in the current code.

@amitdo
Copy link
Collaborator

amitdo commented Nov 20, 2024

  • Set all parameters to default values.

For the parameters that are set with -c, this is already done by tesseract (with C macros).

@zdenop
Copy link
Contributor

zdenop commented Nov 26, 2024

@amitdo: If I understood you correctly, you suggest this kind of modification of tesseract:

diff --git a/src/tesseract.cpp b/src/tesseract.cpp
index 2c27d2a0..a6bd6405 100644
--- a/src/tesseract.cpp
+++ b/src/tesseract.cpp
@@ -273,32 +273,6 @@ static void PrintHelpMessage(const char *program) {
      program, program, program);
}

-static bool SetVariablesFromCLArgs(tesseract::TessBaseAPI &api, int argc, char **argv) {
-  bool success = true;
-  char opt1[256], opt2[255];
-  for (int i = 0; i < argc; i++) {
-    if (strcmp(argv[i], "-c") == 0 && i + 1 < argc) {
-      strncpy(opt1, argv[i + 1], 255);
-      opt1[255] = '\0';
-      char *p = strchr(opt1, '=');
-      if (!p) {
-        fprintf(stderr, "Missing = in configvar assignment\n");
-        success = false;
-        break;
-      }
-      *p = 0;
-      strncpy(opt2, strchr(argv[i + 1], '=') + 1, sizeof(opt2) - 1);
-      opt2[254] = 0;
-      ++i;
-
-      if (!api.SetVariable(opt1, opt2)) {
-        fprintf(stderr, "Could not set option: %s=%s\n", opt1, opt2);
-      }
-    }
-  }
-  return success;
-}
-
static void PrintLangsList(tesseract::TessBaseAPI &api) {
  std::vector<std::string> languages;
  api.GetAvailableLanguagesAsVector(&languages);
@@ -485,7 +459,16 @@ static bool ParseArgs(int argc, char **argv, const char **lang, const char **ima
      *print_fonts_table = true;
#endif  // ndef DISABLED_LEGACY_ENGINE
    } else if (strcmp(argv[i], "-c") == 0 && i + 1 < argc) {
-      // handled properly after api init
+      std::string argument(argv[i + 1]);
+      auto equal_pos = argument.find('=');
+      if (equal_pos == std::string::npos) {
+          throw std::invalid_argument("Missing '=' in configvar assignment");
+      }
+      // Extract key and value
+      std::string key = argument.substr(0, equal_pos);
+      std::string value = argument.substr(equal_pos + 1);
+      vars_vec->push_back(key);
+      vars_values->push_back(value);
      ++i;
    } else if (*image == nullptr) {
      *image = argv[i];
@@ -736,10 +719,6 @@ int main(int argc, char **argv) {
  const int init_failed = api.Init(datapath, lang, enginemode, &(argv[arg_i]), argc - arg_i,
                                   &vars_vec, &vars_values, false);

-  if (!SetVariablesFromCLArgs(api, argc, argv)) {
-    return EXIT_FAILURE;
-  }
-
  // SIMD settings might be overridden by config variable.
  tesseract::SIMDDetect::Update();

Right?

I am just wandering about that comment (handled properly after api init). Maybe some parameter should be set after init??? At least tesseract -l eng --psm 7 -c "tessedit_char_whitelist=0123456789" -c "tessedit_ocr_engine_mode=0" test_0.png - works after above modification.

@amitdo
Copy link
Collaborator

amitdo commented Nov 26, 2024

Zdenko,

About your patch - LGTM.

About that comment. I think you added it. You just didn't take into account the INIT parameters.

Maybe some parameter should be set after init???

I don't think there is a good reason to do this.

@amitdo
Copy link
Collaborator

amitdo commented Nov 26, 2024

BTW, the number of INIT parameters is very small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
@zdenop @sdubinsky @stweil @amitdo and others