Skip to content

deborahgu/abbyy-to-epub3

Repository files navigation

ABBYY XML to EPUB3

Introduction

This module transforms ABBYY XML documents, generated by ABBYY FineReader 10, into primitively accessible ePub 3. The code is optimized for ABBYY XML documents created by the Internet Archive, though it may work for other ABBYY XML as well.

Features

  1. Unicode-compliant
  2. Can handle left-to-right and right-to-left text.
  3. Attempts to recognize running headers, footers, and decimal or page numbers. Level of confidence in fuzzy matching can be fine tuned in config.ini. Errs on the side of minimizing false positives.
  4. Will use Kakadu image libraries if present, otherwise will fall back to Pillow.

Limitations

  1. Accessibility is inherently limited by the input ABBYY FineReader documents. If they are marked up with headings and other semantic markup, that structure will be incorporated into the ePub.
  2. There is currently no functionality for image description.
  3. The module can also transform ABBYY XML documents generated by ABBYY FineReader 6. However, those documents are not marked up with headings, so there is no structural navigation for accessibility.

Requirements

  • Python 3
  • If running epubcheck, a Java Runtime environment
  • If running DAISY Ace, NodeJS version >= 6.4.0
  • If using Kakadu, install the binaries and add the your PATH and LD_LIBRARY_PATH

Usage

From within a Python program:

from abbyy_to_epub3 import create_epub
book = create_epub.Ebook('docname')  # See *Assumptions* below.
book.craft_epub()

From the shell:

abbyy2epub docname     # See *Assumptions* below.

The available command line arguments are:

usage: abbyy2epub [-h] [-d] [--epubcheck level] [--ace level] docname

Process an ABBYY file into an EPUB

positional arguments:
  item_dir         The file path where this item's files are kept.
  item_identifier  The unique ID of this item.
  item_bookpath    The prefix to a specific book within an item.In a simple
                   book, usually the same as the item_identifier.

optional arguments:
  -h, --help   show this help message and exit
  -d, --debug  Show debugging information
  --epubcheck  Run EpubCheck on the newly created EPUB, given a severity level
  --ace  Run DAISY Ace on the newly created EPUB, given a severity level

System dependencies

Epubcheck: If you'd like to run epubcheck, there are certain system dependencies. Depending on running environment, these may need to be manually installed. On Ubuntu, I installed these with:

sudo apt-get install default-jre libpython3-dev

DAISY Ace: If you'd like to run Ace, there are certain system dependencies. Read the installation instructions, but in a nutshell:

  • Install NodeJS. Important: You need at least version 6.4.0, which is newer than the version in the package manager for many distributions. (E.g. versions of Ubuntu before 17.10 Artful). If you have an older version on your system and you can't upgrade, consider running NodeJS in an isolated environment such as nodeenv.
  • Install Ace:
npm install @daisy/ace -g
  • Create a configuration file for the user account who'll be running the code, in ~/.config/DAISY Ace/. You can modify the configuration per the documentation <https://daisy.github.io/ace/docs/config/>_ but be sure to add this block:
{
    "cli": {
        "return-2-on-validation-error": true
    }
}

Installation

This package can be installed on your local system. From the directory containing setup.py:

pip install -r requirements.txt
python setup.py develop
pip install .

You can rebuild the documentation, which is generated with Sphinx.

cd docs
make html

Deploying at the Internet Archive

Before deploying, make sure you bump the version of the package in __init__.py. Then, run the upload.sh script in the root of the repository and enter the appropriate Internet Archive credentials when prompted.

You can test that the package has been installed correctly by going to https://devpi.archive.org or by running $ pip3 install --upgrade -i https://petaboxdevpi:{PASSWORD}@devpi.archive.org/books/formats abbyy_to_epub3.

Note that petaboxdevpi:{PASSWORD} is not needed inside IA network`

Testing

Run py.test from the top-level app directory. Create new tests in the tests subdirectory.

Assumptions

An item may contain 1 or more books. In order to accommodate this subtlety and delineate between books, an item_dir and item_identifier are not sufficient to isolate a specific book. To circumvent this limitation, we require another identifier called the item_bookpath which acts as a prefix to the files of a specific book. Given a datanode and an item_dir of an item, all the constituent files for a book can be constructed using item_identifier and item_bookpath in the following ways:

  • The item_identifier (the unique ID of this item)
  • The item_dir is the file path where this items files are kept
  • The item_bookpath is name of the particular book file, often the same as item_identifier

The structure is assumed to be:

  • scandata.xml describes the structure of the book (metadata, pages numbers)
  • docname_abbyy.gz unzips to docname_abbyy, an XML file generated by ABBYY.
  • docname_jp2.zip unzips to a directory called docname_jp2, which includes a number of documents in the format docname_####.jp2.
  • The scandata has hopefully marked up one leaf as 'Cover'. Failing that, we will use the first leaf marked 'Title', and failing that, the first leaf marked 'Normal'.
  • There is a single global metadata manifest file for the entire item named {item_identifier}_meta.xml.
  • All of the other book specific files follow the form {item_bookpath}_{file}. e.g. {item_bookpath}_abbyy.gz

Further Reading

Module documentation is available at Read The Docs.

Contribute

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published