-
Notifications
You must be signed in to change notification settings - Fork 1
Text Extraction Workflows
OCR runs as its own workflow, and it runs after common accessioning is run, in its own version of the object. For more information on the architectural choices, see https://github.com/sul-dlss-labs/engineering-design-adrs/blob/main/0022-derivative-generation.md
OCR itself is produced by ABBYY, which is Windows software installed on it's own server. The ABBYY instance is managed by Ops and DPG and is also used by the Goobi system. Goobi is used by DPG for certain imaging workflows (e.g. scanning).
The ocrWF moves content from SDR preservation to shared mounts on the ABBYY server, which are monitored by ABBYY. The workflow also creates instructions for ABBYY in the form of an .XML "ticket" file. ABBYY discovers these files and performs the OCR, generating .XML "result" files and the actual OCR content. An ABBYY watcher process in common accessioning monitors for new files generated by ABBYY and then sets the appropriate workflow step to complete, allowing the workflow to continue. The new OCR files are then accessioned in their own version.
Additional information on what do if there are outages with ABYY or the watcher process, see https://github.com/sul-dlss/DevOpsDocs/blob/master/projects/common-accessioning/common-accessioning-ops-concerns.md
For additional information on the workcycle that produced the OCR code, see https://drive.google.com/drive/u/1/folders/13-mfhKMiWg7A8J7GLnAkH1jLRMXD6HVY
There is a step in the workflow called "ocr-workspace-cleanup" which should remove most intermediate files in the shared ABBYY mount that are no longer needed once the workflow is complete. However, testing and/or mount issues and/or images placed in the EXCEPTIONS folder may get left behind. A rake task is provided to do cleanup. It defaults to a dry run mode that tells what would be deleted if no options are provided:
Dry run (default):
ROBOT_ENVIRONMENT=production bundle exec rake abbyy:cleanup
❗ Delete for real (pass true
for should_perform_deletions
):
ROBOT_ENVIRONMENT=production bundle exec rake 'abbyy:cleanup[true]'
It will (report on or actually perform, depending on the arguments passed to the task):
- Delete empty INPUT folders
- Delete empty OUTPUT folders
- Delete XML ticket files older than 1 week
- Delete (success) XML result files older than 1 week
- Delete (failed) XML result files and images in the EXCEPTIONS folder older than 1 month