- About
- Installation
- Available Programs
- hocr-check -- check the hOCR file for errors
- hocr-combine -- combine pages in multiple hOCR files into a single document
- hocr-eval -- compute number of segmentation and OCR errors
- hocr-eval-geom -- compute over, under, and mis-segmentations
- hocr-eval-lines -- compute OCR errors of hOCR output relative to text ground truth
- hocr-extract-g1000 -- extract lines from Google 1000 book sample
- hocr-extract-images -- extract the images and texts within all the ocr_line elements
- hocr-lines -- extract the text within all the ocr_line elements
- hocr-merge-dc -- merge Dublin Core meta data into the hOCR HTML header
- hocr-pdf -- create a searchable PDF from a pile of hOCR and IMAGE
- hocr-split -- split an hOCR file into individual pages
- hocr-wordfreq -- calculate word frequency in an hOCR file
- Unit tests
hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation.
There is a Public Specification for the hOCR Format.
Each command line program is self contained; if you have Python 2.7 with the required packages installed, it should just work. (Unfortunately, that means some code duplication; we may revisit this issue in later revisions.)
You can install hocr-tools along with its dependencies from PyPI:
sudo pip install hocr-tools
On a Debian/Ubuntu system, install the dependencies from packages:
sudo apt-get install python-lxml python-reportlab python-pil \
python-beautifulsoup python-numpy python-scipy python-matplotlib
Or, to fetch dependencies from the cheese shop:
sudo pip install -r requirements.txt # basic
Then install the dist:
sudo python setup.py install
Once
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
Subsequently
source venv/bin/activate
./hocr-...
Included command line programs:
hocr-check [-h] [-o] file.hocr
Perform consistency checks on the hOCR file.
hocr-combine [-h] file1.html [file2.html ...]
Combine the OCR pages contained in each HTML file into a single document. The document metadata is taken from the first file.
hocr-eval-lines [-h] [-v] true-lines.txt actual.hocr
Evaluate hOCR output against ASCII ground truth. This evaluation method requires that the line breaks in true-lines.txt and the ocr_line elements in hocr-actual.html agree (most ASCII output from OCR systems satisfies this requirement).
hocr-eval-geom [-h] [-e ELEMENT] [-o SIGNIFICANT_OVERLAP] [-c CLOSE_MATCH] truth.hocr actual.hocr
Compare the segmentations at the level of the element name (default: ocr_line). Computes undersegmentation, oversegmentation, and missegmentation.
hocr-eval [-h] [-d] [-v] [-i IMGFILE] true.hocr actual.hocr
Evaluate the actual OCR with respect to the ground truth. This outputs the number of OCR errors due to incorrect segmentation and the number of OCR errors due to character recognition errors.
It works by aligning segmentation components geometrically, and for each segmentation component that can be aligned, computing the string edit distance of the text the segmentation component contains.
Extract lines from Google 1000 book sample
hocr-extract-images [-h] [-b BASENAME] [-p PATTERN] [-e ELEMENT] file.hocr
Extract the images and texts within all the ocr_line elements within the hOCR file.
The BASENAME
is the image directory, the default pattern is line-%03d.png
and
the default element is ocr_line
.
hocr-lines [-h] file.hocr
Extract the text within all the ocr_line elements within the hOCR file
given by FILE. If called without any file, hocr-lines
reads
hOCR data from stdin.
hocr-merge-dc [-h] dc.xml input.hocr > merge.hocr
Merges the Dublin Core metadata into the hOCR file by encoding the data in its header.
Usage: hocr-pdf [-h] [-d dpi] [-e ext] [-f font] [-i images] [-o outfile] [-n] [-r] [-v] [-m] imgdir
Create a searchable PDF from a pile of hOCR and IMAGE.
It is important that the corresponding IMAGE and hOCR files have the same name with their respective file ending.
All of these files should lie in one directory, which one has to specify as an argument when calling the command,
e.g. use hocr-pdf -o out.pdf .
to run the command in the current directory and save the output as out.pdf
.
hocr-split [-h] file.hocr pattern
Split a multipage hOCR file into hOCR files containing one page each. The pattern should something like "base-%03d.html"
hocr-wordfreq [-h] [-i] [-s] [-y] [-n MAX] file.hocr
Outputs a list of the most frequent words in an hOCR file with their number of occurrences.
If called without any file, hocr-wordfreq
reads hOCR data (for example from hocr-combine
) from stdin.
By default, the first 10 words are shown, but any number can be requested with -n
.
Use -i
to ignore upper and lower case.
The unit tests are written using the tsht framework.
./test/tsht
./test/tsht <path-to/unit-test.tsht>
e.g.
./test/tsht test/hocr-pdf/test-hocr-pdf.tsht
Please see the documentation in the tsht repository and take a look at the existing unit tests.
- Create a new directory under
./test
- Copy any test assets (images, hOCR files...) to this directory
- Create a file
<name-of-your-test>.tsht
starting from this template:
#!/usr/bin/env tsht
# adjust to the number of your tests
plan 1
# write your tests here
exec_ok "hocr-foo" "-x" "foo"
# remove any temporary files
# rm some-generated-file