-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Challenges processing textract #187
Comments
I don't know why the local docker installation does not pick up textract2page (assuming you ran If you do install textract2page, consider updating to slub/textract2page#23 – it should give best results, but is still not 100% tested. |
@bertsky Thanks for your reply! Indeed, I had been using the version on Dockerhub. Since I'm currently most interested in the textract2page conversion, I think I'll focus first on trying to get output from textract2page utility, and once that works, I'll work on ocr-fileformat. For both the master branch and your slub/textract2page#23 PR, I got the same results as I did above - a 0-byte XML file. To troubleshoot, I tried running textract2page on one of the included test images and got the same result: textract2page on toplevel-reading-order [?] via 🐍 v3.11.7
❯ textract2page tests/workspace/textract_responses/18xx-Missio-EMU-0042.json tests/workspace/images/18xx-Missio-EMU-0042.jpg > test.xml
textract2page on toplevel-reading-order [?] via 🐍 v3.11.7
❯ ls -l test.xml
-rw-r--r--@ 1 joe staff 0 Aug 26 08:55 test.xml Do you have any suggestions for coaxing textract2page to return results? I'd be happy to provide any info about my environment that would help. For starters, I'm on an Intel Mac running macOS 14.6.1, with python 3.11 (miniconda3-3.11-24.1.2-0, installed via asdf). |
With an eye toward getting the current version of ocr-fileformat working locally, I went ahead and installed it using the instructions in the README. While the build steps appeared to work perfectly, I am getting an error when calling the utility. ❯ ~/.local/bin/ocr-transform -h
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 39: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 41: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 43: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 57: alto4.2__alto2.1: syntax error: invalid arithmetic operator (error token is ".2__alto2.1")
Usage:
ocr-transform [OPTIONS] <from> <to> [<infile> [<outfile>]] [-- <script-args>]
ocr-transform [OPTIONS] <from> <to> --help-args Show script-args, and exit
ocr-transform [OPTIONS] -h|--help Show this help, and exit
ocr-transform [OPTIONS] -v|--version Show version, and exit
ocr-transform [OPTIONS] -L|--list List available from/to, and exit
Options:
--debug -d Increase debug level by 1, can be repeated
Transformations:
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 23: show_transformations: command not found With the debug flag set: ❯ ~/.local/bin/ocr-transform -d -d
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 39: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 41: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 43: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 57: alto4.2__alto2.1: syntax error: invalid arithmetic operator (error token is ".2__alto2.1")
+ local from= to= infile=- outfile=- transformer
+ shift 2
+ [[ -z '' ]]
+ show_usage 'Must set '\''from'\'' parameter'
+ [[ 1 -gt 0 ]]
+ logerr 'Must set '\''from'\'' parameter'
+ local 'IFS=
'
+ for line in '$*'
+ echo -e '\033[0m[\033[1;31mERROR\033[0m] Must set '\''from'\'' parameter'
[ERROR] Must set 'from' parameter
+ echo 'Usage:
ocr-transform [OPTIONS] <from> <to> [<infile> [<outfile>]] [-- <script-args>]
ocr-transform [OPTIONS] <from> <to> --help-args Show script-args, and exit
ocr-transform [OPTIONS] -h|--help Show this help, and exit
ocr-transform [OPTIONS] -v|--version Show version, and exit
ocr-transform [OPTIONS] -L|--list List available from/to, and exit
Options:
--debug -d Increase debug level by 1, can be repeated
'
Usage:
ocr-transform [OPTIONS] <from> <to> [<infile> [<outfile>]] [-- <script-args>]
ocr-transform [OPTIONS] <from> <to> --help-args Show script-args, and exit
ocr-transform [OPTIONS] -h|--help Show this help, and exit
ocr-transform [OPTIONS] -v|--version Show version, and exit
ocr-transform [OPTIONS] -L|--list List available from/to, and exit
Options:
--debug -d Increase debug level by 1, can be repeated
+ echo -e '\n Transformations:'
Transformations:
+ show_transformations
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 23: show_transformations: command not found
+ sed 's/^/ /'
+ [[ 1 -gt 0 ]]
+ exit 1 I'd be happy to provide any info or perform any steps that could help shed light. |
How did you install textract2page – in a local venv (as advised by the readme)? Also, if you did pull the PR: you probably have to specify What does Regarding the installation of ocr-fileformat on MacOS – I believe you first need to install bash (and probably other packages) via brew. |
Oops, sorry – turns out textract2page recently introduced a packaging bug, which affected editable installs. Please pull from the PR again (tip should be at bf89b08)! |
@bertsky Thank you! And to make a long story short, your suggestion for building from your branch worked, and I was able to successfully generate Page XML for both the included sample files and my own! Here's the final output, using the input file I mentioned in the original posting above. The only possible issue is that the HEAD of your branch is 4eb96ab, not bf89b08 as you just mentioned. But I've left my full answers to your questions, in case there is any useful info there. First, as to how I first installed textract2page, I used python 3.11 (miniconda3-3.11-24.1.2-0, which I'd installed via asdf), with the following commands: pip install textract2page (After installing I got a pip install ocrd_modelfactory ... which allowed the utility to run without the same error, but still was yielding 0-byte XML output. Second, when installing from your branch, I had cloned the repo, switched to your branch, and built using the following (also suggested by ChatGPT in the same thread linked above): pip install -r requirements.txt Here's the output of the commands you asked about, under the original attempted installation of your branch: ❯ pip show textract2page
DEPRECATION: Loading egg at /Users/joe/.asdf/installs/python/miniconda3-3.11-24.1.2-0/lib/python3.11/site-packages/textract2page-0.0.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330
Name: textract2page
Version: 0.0.0
Summary: Convert AWS Textract JSON to PRImA PAGE XML
Home-page:
Author: Arne Rümmler
Author-email: [email protected]
License: Apache Software License
Location: /Users/joe/.asdf/installs/python/miniconda3-3.11-24.1.2-0/lib/python3.11/site-packages/textract2page-0.0.0-py3.11.egg
Requires: click, ocrd, pillow
Required-by:
❯ python -c "import textract2page; print(textract2page.__version__)"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Users/joe/workspace/textract2page/textract2page/__init__.py", line 4, in <module>
from ._version import version
ModuleNotFoundError: No module named 'textract2page._version' After installing as you suggested (via ❯ pip show textract2page
DEPRECATION: Loading egg at /Users/joe/.asdf/installs/python/miniconda3-3.11-24.1.2-0/lib/python3.11/site-packages/textract2page-0.0.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330
Name: textract2page
Version: 0.0.0
Summary: Convert AWS Textract JSON to PRImA PAGE XML
Home-page:
Author: Arne Rümmler
Author-email: [email protected]
License: Apache Software License
Location: /Users/joe/.asdf/installs/python/miniconda3-3.11-24.1.2-0/lib/python3.11/site-packages/textract2page-0.0.0-py3.11.egg
Requires: click, ocrd, pillow
Required-by:
❯ python -c "import textract2page; print(textract2page.__version__)"
0.3.dev15+g4eb96ab And now, once I run textract2page on both the included test files and my own, it works! Thank you! |
Thank you for providing this great tool! I'm writing with two questions about transforming textract output into page xml.
First, when I use the digi.bib.uni-mannheim.de hosted version of the ocr-fileformat application and try to transform textract JSON to page, the resulting XML file is empty.
Here's a sample textract input and resulting page output.
No errors - just an empty file.
For context, the AWS CLI command I used to produce this input was (with Bucket, Name, and Region obfuscated here):
Does anyone have tips for ways to ensure ocr-fileformat can process textract output? (If not, I could ask at the upstream project https://github.com/slub/textract2page/ - but I figured I'd start here.)
Second, when I try to process the output locally via Docker, I don't see the textract option. From https://digi.bib.uni-mannheim.de/ocr-fileformat/:
From version running locally in Docker:
Does anyone have suggestions for getting the textract input option to appear when running locally via Docker?
Thank you!
The text was updated successfully, but these errors were encountered: