-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLI: Unexpected handling of parameters using -c VAR=VALUE #4354
Comments
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
Which tessdata you use? |
eng.traineddata |
I'm not sure how to find the version |
Check filesize of eng.traineddata and compare it with https://github.com/tesseract-ocr/tessdata, https://github.com/tesseract-ocr/tessdata_best, https://github.com/tesseract-ocr/tessdata_fast
|
I am using the first version linked. |
OK. I think I know ( ;-) ) the reason: tesseract/src/ccmain/tesseractclass.cpp Lines 119 to 123 in 43b8d74
e.g. it must be set during initialisation of API/language model (see Lines 736 to 741 in 43b8d74
But variables speficied via "-c" argument are setup by function |
Yes, that's the right explanation. |
Thanks! |
@sdubinsky, |
@amitdo : why do you claim it is a bug? Can you elaborate on this? |
@zdenop, |
@amitdo : My question is about your claims (bug, wrong handling of parameters). |
You already analyzed what's wrong in the code which makes any init parameter set with -c a no-op. I don't see how this can not be classified as a bug. |
Not any. It depends on the parameter. Many parameters are used after init, and these parameters work fine. Examples: debug output, thresholding, ... |
Yes, I know, but 'wrong handling of parameters' is still true, because the code wrongly defers the task of parsing the -c VAR=VALUE (and pushing them to vectors to be used by init()). |
So ideally Tesseract should
|
Using:
Works fine when the config file is in the right path. |
I mean, the config files are already given to init() in the current code. |
For the parameters that are set with |
@amitdo: If I understood you correctly, you suggest this kind of modification of tesseract: diff --git a/src/tesseract.cpp b/src/tesseract.cpp
index 2c27d2a0..a6bd6405 100644
--- a/src/tesseract.cpp
+++ b/src/tesseract.cpp
@@ -273,32 +273,6 @@ static void PrintHelpMessage(const char *program) {
program, program, program);
}
-static bool SetVariablesFromCLArgs(tesseract::TessBaseAPI &api, int argc, char **argv) {
- bool success = true;
- char opt1[256], opt2[255];
- for (int i = 0; i < argc; i++) {
- if (strcmp(argv[i], "-c") == 0 && i + 1 < argc) {
- strncpy(opt1, argv[i + 1], 255);
- opt1[255] = '\0';
- char *p = strchr(opt1, '=');
- if (!p) {
- fprintf(stderr, "Missing = in configvar assignment\n");
- success = false;
- break;
- }
- *p = 0;
- strncpy(opt2, strchr(argv[i + 1], '=') + 1, sizeof(opt2) - 1);
- opt2[254] = 0;
- ++i;
-
- if (!api.SetVariable(opt1, opt2)) {
- fprintf(stderr, "Could not set option: %s=%s\n", opt1, opt2);
- }
- }
- }
- return success;
-}
-
static void PrintLangsList(tesseract::TessBaseAPI &api) {
std::vector<std::string> languages;
api.GetAvailableLanguagesAsVector(&languages);
@@ -485,7 +459,16 @@ static bool ParseArgs(int argc, char **argv, const char **lang, const char **ima
*print_fonts_table = true;
#endif // ndef DISABLED_LEGACY_ENGINE
} else if (strcmp(argv[i], "-c") == 0 && i + 1 < argc) {
- // handled properly after api init
+ std::string argument(argv[i + 1]);
+ auto equal_pos = argument.find('=');
+ if (equal_pos == std::string::npos) {
+ throw std::invalid_argument("Missing '=' in configvar assignment");
+ }
+ // Extract key and value
+ std::string key = argument.substr(0, equal_pos);
+ std::string value = argument.substr(equal_pos + 1);
+ vars_vec->push_back(key);
+ vars_values->push_back(value);
++i;
} else if (*image == nullptr) {
*image = argv[i];
@@ -736,10 +719,6 @@ int main(int argc, char **argv) {
const int init_failed = api.Init(datapath, lang, enginemode, &(argv[arg_i]), argc - arg_i,
&vars_vec, &vars_values, false);
- if (!SetVariablesFromCLArgs(api, argc, argv)) {
- return EXIT_FAILURE;
- }
-
// SIMD settings might be overridden by config variable.
tesseract::SIMDDetect::Update();
Right? I am just wandering about that comment ( |
Zdenko, About your patch - LGTM. About that comment. I think you added it. You just didn't take into account the INIT parameters.
I don't think there is a good reason to do this. |
BTW, the number of INIT parameters is very small. |
Note by @amitdo. The original title was:
Inconsistency Between --oem and -c "tessedit_ocr_engine_mode"
Current Behavior
As far as I can tell, these two options should be identical, however the attached image produces different output. With --oem it says 11, with the config var it says 1. I should add that with any --oem value other than 0, it returns '1' not '11'.
Full command:
Expected Behavior
Both commands should return the same result, ideally 11.
Suggested Fix
They should either return the same information or the docs should be updated to clarify the difference.
tesseract -v
tesseract 5.4.1
leptonica-1.84.1
libgif 5.2.2 : libjpeg 6b (libjpeg-turbo 3.0.2) : libpng 1.6.40 : libtiff 4.6.0 : zlib 1.3.1.zlib-ng : libwebp 1.4.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found libcurl/8.9.1 OpenSSL/3.2.2 zlib/1.3.1.zlib-ng libidn2/2.3.7 nghttp2/1.62.1
Operating System
No response
Other Operating System
Fedora 42.
uname -a
Linux deux-ex 6.11.7-300.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 8 19:23:10 UTC 2024 x86_64 GNU/Linux
Compiler
No response
CPU
Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz (12)
Virtualization / Containers
None.
Other Information
No response
The text was updated successfully, but these errors were encountered: