Skip to content

Latest commit

 

History

History
240 lines (167 loc) · 9.29 KB

README.md

File metadata and controls

240 lines (167 loc) · 9.29 KB

OCR-D Manager

OCR-D Manager is a server that mediates between Kitodo and OCR-D. It resides on the site of the Kitodo installation (so the actual OCR server can be managed independently) but runs in its own container (so Kitodo can be managed independently).

Specifically, it gets called by Kitodo.Production or Kitodo.Presentation to handle OCR for a document, and in turn calls the OCR-D Controller for workflow processing.

For an integration as a service container, orchestrated with other containers (Kitodo+Controller+Monitor), see this meta-repo.

OCR-D Manager is responsible for

  • data transfer from Kitodo to Manager to Controller and back,
  • delegation to Controller,
  • signalling/reporting,
  • result validation,
  • result extraction (putting ALTO files in the process directory where Kitodo.Production expects them, or updating the METS for Kitodo.Presentation).

It is currently implemented as SSH login server with an installation of OCR-D core and an SSH client to connect to the Controller.

Usage

Building

Build or pull the Docker image:

make build # or docker pull ghcr.io/slub/ocrd_manager

Starting and mounting

Then run the container – providing a host-side directory for the volumes …

  • DATA: directory for data processing (including images or existing workspaces),
    defaults to current working directory
  • WORKFLOWS: directory for scripts (preconfigured workflows),
    defaults to ./workflows in current working directory

… but also files …

  • KEYS: public key credentials for log-in to the manager
  • PRIVATE: private key credentials for log-in to the controller …

… and (optionally) some environment variables

  • UID: numerical user identifier to be used by programs in the container
    (will affect the files modified/created); defaults to current user
  • GID: numerical group identifier to be used by programs in the container
    (will affect the files modified/created); defaults to current group
  • UMASK: numerical user mask to be used by programs in the container
    (will affect the files modified/created); defaults to 0002
  • PORT: numerical TCP port to expose the SSH server on the host side
    defaults to 9022 (for non-priviledged access)
  • CONTROLLER network address:port for the controller client (must be reachable from the container network)
  • ACTIVEMQ network address:port of ActiveMQ server listening to result status (must be reachable from the container network)
  • NETWORK name of the Docker network to use
    defaults to bridge (the default Docker network)

… thus, for example:

make run DATA=/mnt/workspaces WORKFLOWS=/mnt/workflows KEYS=~/.ssh/id_rsa.pub PORT=9022 PRIVATE=~/.ssh/id_rsa

(You can also run the service via docker-compose manually – just cp .env.example .env and edit to your needs.)

General management

Then you can log in as user ocrd from remote (but let's use manager in the following – without loss of generality):

ssh -p 9022 ocrd@manager bash -i

(Typically though, you will run a non-interactive script, see next section.)

Processing

In the Manager, you can run shell scripts that do

  • data management and validation via ocrd CLIs
  • OCR processing by running workflows in the controller via ssh ocrd@ocrd_controller log-ins

The data management will depend on which Kitodo context you want to integrate into (Production 2 / 3 or Presentation).

From image to ALTO files

For Kitodo.Production, there is a preconfigured script process_images.sh (or for_production.sh) which takes the following arguments:

SYNOPSIS:

process_images.sh [OPTIONS] DIRECTORY

where OPTIONS can be any/all of:
 --lang LANGUAGE    overall language of the material to process via OCR
 --script SCRIPT    overall script of the material to process via OCR
 --workflow FILE    workflow file to use for processing, default:
                    ocr-workflow-default.sh
 --no-validate      skip comprehensive validation of workflow results
 --img-subdir IMG   name of the subdirectory to read images from, default:
                    images
 --ocr-subdir OCR   name of the subdirectory to write OCR results to, default:
                    ocr/alto
 --proc-id ID       process ID to communicate in ActiveMQ callback
 --task-id ID       task ID to communicate in ActiveMQ callback
 --help             show this message and exit

and DIRECTORY is the local path to process. The script will import
the images from DIRECTORY/IMG into a new (temporary) METS and
transfer this to the Controller for processing. After resyncing back
to the Manager, it will then extract OCR results and export them to
DIRECTORY/OCR.

If ActiveMQ is used, the script will exit directly after initialization,
and run processing in the background. Completion will then be signalled
via ActiveMQ network protocol (using the proc and task ID as message).

ENVIRONMENT VARIABLES:

 CONTROLLER: host name and port of OCR-D Controller for processing
 ACTIVEMQ: URL of ActiveMQ server for result callback (optional)
 ACTIVEMQ_CLIENT: path to ActiveMQ client library JAR file (optional)

The workflow parameter is optional and defaults to the preconfigured script ocr-workflow-default.sh which contains a trivial workflow:

  • import of the images into a new OCR-D workspace
  • preprocessing, layout analysis and text recognition with a single Tesseract processor call
  • format conversion of the result from PAGE-XML to ALTO-XML

It can be replaced with the (path) name of any workflow script mounted under /workflows or /data.

For example (assuming testdata is a directory with image files mounted under /data):

ssh -T -p 9022 ocrd@manager process_images.sh --proc-id 1 --task-id 3 --lang deu --script Fraktur --workflow myocr.sh testdata

From METS to METS file

For Kitodo.Presentation, there is a preconfigured script process_mets.sh (or for_presentation.sh) which takes the following arguments:

SYNOPSIS:

process_mets.sh [OPTIONS] METS

where OPTIONS can be any/all of:
 --workflow FILE    workflow file to use for processing, default:
                    ocr-workflow-default.sh
 --no-validate      skip comprehensive validation of workflow results
 --pages RANGE      selection of physical page range to process
 --img-grp GRP      fileGrp to read input images from, default:
                    DEFAULT
 --ocr-grp GRP      fileGrp to write output OCR text to, default:
                    FULLTEXT
 --url-prefix URL   convert result text file refs from local to URL
                    and prefix them
 --help             show this message and exit

and METS is the path of the METS file to process. The script will copy
the METS into a new (temporary) workspace and transfer this to the
Controller for processing. After resyncing back, it will then extract
OCR results and copy them to METS (adding file references to the file
and copying files to the parent directory).

ENVIRONMENT VARIABLES:

 CONTROLLER: host name and port of OCR-D Controller for processing

For the workflow parameter, the same goes here as above.

For example (assuming testdata is a directory with image files mounted under /data):

ssh -T -p 9022 ocrd@manager process_mets.sh --lang deu --script Fraktur --workflow myocr.sh testdata/mets.xml

Data transfer

For sharing data between the Manager and Controller, it is recommended to transfer files explicitly (as this will make the costs more measurable and controllable).

(This is currently implemented via rsync.)

The data lifecycle should be:

  • on Controller: short-lived
  • on Manager: as long as process is active in Production

(This is currently not managed.)

Logging

All logs are accumulated on standard output, which can be inspected via Docker:

docker logs ocrd_manager

Logs for all services can also be viewed on the Monitor web server.

Testing

After building and starting, you can use the test target for a round-trip:

make test DATA=/mnt/workspaces

This will download sample data and run the default workflow on them. (All logging is still accumulated on the Docker output, so the shell itself will not print any. See above)

(If the Manager has been started externally already, make sure to pass the correct value for the NETWORK variable – the makefile will then attempt to use docker exec instead of ssh ocrd@localhost to connect.)

To clean up the results, use:

make clean-testdata

Maintainers

If you have any questions or encounter any problems, please do not hesitate to contact us.