Skip to content

DAISY2.02 audiobooks with TOC PageList conversion to Readium WebPub Manifest

Daniel Weck edited this page Aug 4, 2023 · 12 revisions

Prerequisites:

  1. node --version => 18 (or greater)
  2. npm --version => 9 (or greater)

https://nodejs.org

Installation:

  1. npm install json-diff --global
  2. npm install r2-shared-js --global
  3. npm install r2-streamer-js --global

In case of filesystem permission failures, try with sudo in Linux and Mac, or in Windows try opening the shell with "run as administrator" (sometimes --unsafe-perm=true helps too)

Verify installed "binaries" (i.e. globally-available NodeJS scripts):

  1. which r2-shared-js-cli => /usr/local/bin/r2-shared-js-cli (for example, on Mac)
  2. which r2-streamer-js-server => /usr/local/bin/r2-streamer-js-server (for example, on Mac)

Note that a future revision of the CLI utilities will include a UNIX "shebang" at the top of the JS file in order to automatically invoke Node executable. See below for example on how to start the scripts.

Test the "r2-streamer-js" server

Assuming some EPUB files are present inside a folder path (replace PATH_TO_EPUB_FOLDER with your own filesystem location, which can be absolute or relative to the current pwd folder):

  • DEBUG=r2:* node /usr/local/bin/r2-streamer-js-server PATH_TO_EPUB_FOLDER (note that DEBUG=r2:* is optional, but useful to display runtime information in the console ... for even more verbosity, use DEBUG=*)
  • Open a web browser with URL http://127.0.0.1:3000 (as indicated in the console)
  • Click on any blue link at the top of the page (each link corresponds to an EPUB file discovered inside the folder, but note that subfolders are not scanned by this simple server demo / test CLI)
  • Click on the ./manifest.json/show/all link, this will display the Readium WebPub Manifest with clickable links to resources (images, CSS, HTML, etc.)
  • Note that the http://127.0.0.1:3000/pub/_ID_/manifest.json URL endpoint (without /show/all) serves the raw JSON resource, which is probably what a real world deployment would use. The /show/all URL is here to facilitate debugging / exploration of Readium WebPub Manifest JSON.
  • In the above URL, the _ID_ token represents the "unique identifier" of the publication served by the streamer software component. This is not dc:identifier / ISBN / UUID, etc., this is in fact the base64 encoding of the publication's filepath.

A production deployment of the r2-streamer-js would typically not use the built-in CLI as-is (i.e. https://github.com/readium/r2-streamer-js/blob/develop/src/http/server-cli.ts ), but instead a smarter CLI should be implemented to meet real-world needs. The core server runtime can be created with the following lines of code:

        const server = new Server({
            // options
        });
        server.preventRobots(); // for example
        server.addPublications(files); // <=== this can be called any time after the server starts (incremental add/remove of publications, cache management)
        const url = await server.start(0, false);

See: https://github.com/readium/r2-streamer-js/blob/41a8241cb05dc8fd2c385b09506a39c081005c9c/src/http/server-cli.ts#L122-L128

File watcher mode in the "r2-streamer-js" server

The CLI that ships with the r2-streamer-js NPM package now includes a "file watcher" mode. This is activated when the environment variable STREAMER_WATCH is set to value 1. For example on MacOS: DEBUG=r2:* STREAMER_WATCH=1 node /usr/local/bin/r2-streamer-js-server PATH_TO_EPUB_FOLDER. When "watch" mode is activated, publication files added inside the PATH_TO_EPUB_FOLDER folder will automatically be served by the streamer instance. In other words, no need to restart the server when the contents of the PATH_TO_EPUB_FOLDER folder change. Removed files will also be reflected automatically in the streamer instance. Note that renaming a file effectively consists of removing the previous filename and adding the new filename. In fact, the same remove+add sequence occurs when a file is updated, so that any cached publication manifest in the streamer is invalidated before being served again. The console debug messages (text print on standard shell output) include useful information to verify that file adding/removing is detected correctly by the live streamer process. Note that if the server is started with STREAMER_WATCH=1, then the file watcher can subsequently be paused/resumed simply by changing the process environment variable at runtime (i.e. STREAMER_WATCH=0 to pause file detection, STREAMER_WATCH=1 to resume). Naturally, if file are removed from the filesystem while the watcher is in "paused" state, these files are not removed from the active streamer instance and consequently any further attempt to access the publications resources will result in failure. In other words, resuming the watcher with STREAMER_WATCH=1 will not magically synchronise the state of the filesystem with the internal streamer state. There is special treatment for filenames that end in _manifest.json (e.g. BOOK.epub_manifest.json), as these indicate that they are associated with the original publication filename (e.g. BOOK.epub_manifest.json ==> BOOK.epub). In this example, when BOOK.epub is deleted or renamed from the PATH_TO_EPUB_FOLDER folder, BOOK.epub_manifest.json (if it exists) is also automatically removed from the streamer instance (i.e. it is not served anymore). Conversely, if BOOK.epub re-appears in the publication folder, and there is a BOOK.epub_manifest.json file present in the folder, then both are added to the streamer instance. Finally, when a file named BOOK.epub_manifest.json appears in the folder, it is only added to the streamer instance if its associated publication BOOK.epub exists on the filesystem (the same check is made on initial server startup, to avoid serving "orphan" externalised _manifest.json files). On first impressions, it would seem redundant to serve both the original/untouched publication and its externalised manifest.json, but in fact this could be useful to separately stream an original EPUB as well as its tweaked external manifest.json (e.g. modified metadata). Note that for technical and historical design reasons, the CLI utility shipped with the r2-streamer-js NPM package supports only a "flat" publication folder (i.e. not deep / recursive), when the server is initially started. However the file watcher itself will detect adding/removing/renaming/changing files deep inside the publication folder. Any file that ends with a supported extension (currently: .epub, .epub3, .cbz, .audiobook, .lcpaudiobook, .lcpa, .divina, .lcpdivina) or with the special _manifest.json suffix will be included in the watcher detection. Files that end with extension .zip or .daisy are not directly supported by the streamer for "just in time" processing, as they require expensive "ahead of time" transformation using the r2-shared-js CLI utility. Finally, note that removing publication files causes an attempt to dispose filesystem resources (e.g. close the underlying ZIP file handle). Existing HTTP connections to the publication resources will therefore fail (with streaming audio/video media, the client-side behaviour will depend on what state the playback buffer is, on HTTP caching, on the sequence of HTTP 1.1 byte range requests, etc.).

Try the "r2-shared-js" CLI

Assuming some DAISY2.02 audio-only (see note below) publications are present inside a folder path (replace PATH_TO_DAISY_FOLDER with your own filesystem location, which can be absolute or relative to the current pwd folder):

NOTE: since v1.0.69, full synchronized text/audio publications are supported too (see https://github.com/readium/r2-shared-js/blob/develop/CHANGELOG.md#1069 )

  • DEBUG=r2:* node /usr/local/bin/r2-shared-js-cli PATH_TO_DAISY_FOLDER/book.zip PATH_TO_DAISY_FOLDER generate-daisy-audio-manifest-only (note that DEBUG=r2:* is optional, but useful to display runtime information in the console ... for even more verbosity, use DEBUG=*)

In the above example, PATH_TO_DAISY_FOLDER/book.zip refers to a zipped DAISY fileset, but the command works with exploded / unzipped contents too:

  • DEBUG=r2:* node /usr/local/bin/r2-shared-js-cli PATH_TO_DAISY_FOLDER/book/ PATH_TO_DAISY_FOLDER generate-daisy-audio-manifest-only

When the DEBUG flag is used, the console displays the following in case of success: DAISY audio only book => manifest-audio.json (generateDaisyAudioManifestOnly ***_manifest.json) and generateDaisyAudioManifestOnly OK: PATH/TO/*_manifest.json and DAISY-EPUB-RWPM done.

The Readium WebPub Manifest JSON files are created based on the original DAISY filename, for example: book.zip_manifest.json or book_manifest.json with the unzipped folder. This file naming convention is critical, the DAISY and JSON file names must be kept in sync.

Note that the generate-daisy-audio-manifest-only command line parameter cannot be used with text-only publications, for obvious reasons. When this CLI parameter is omitted, a full .webpub zipped publication is generated in the destination folder instead of just the JSON manifest. The full conversion process involves renaming files from the original DAISY fileset (notably, XML vs. HTML vs. XHTML file extensions), and other files are created too (notably, DAISY3 DTBOOK to XHTML, or SMIL to XHTML). With audio-only books, the generated .webpub archive can be unzipped to reveal both manifest.json (i.e. the default one which relies on EPUB3 Media Overlays SMIL in order to preserve the phrase-level DAISY navigation) and manifest-audio.json (i.e. the simplified audiobook with reading order, TOC, pagelist, but no phrase-level navigation). A .webpub file can directly be ingested by the streamer via server.addPublications().

Wrapping up

Now, simply start the "r2-streamer-js" test server inside the folder that contains the generated JSON files and original DAISY filesets, in order to demonstrate them working together: DEBUG=r2:* node /usr/local/bin/r2-streamer-js-server PATH_TO_DAISY_FOLDER. The CLI offers an easy way to test the server, but in a real-world scenario the server.addPublications(files) Javascript function would be called after the server is started to enable the on-demand streaming of the Readium WebPub Manifest JSON. For example server.addPublications([PATH_TO_JSON_FILE]), and the streamer will automatically find the corresponding original DAISY book based on the common root filename.

Note that the current r2-streamer-js implementation does not provide an out-of-the-box caching / memory management solution. It is therefore recommended to write additional logic based on server.removePublications(files) or server.uncachePublication(file) in order to ensure that the streamer runtime does not allocate unnecessary memory, and does not keep filesystem handles open during access to zipped publications or unzipped folders. See: https://github.com/readium/r2-streamer-js/blob/a2faa6140074418fc354bca792023b387cb837a3/src/http/server.ts#L304-L332 and: https://github.com/readium/r2-streamer-js/blob/a2faa6140074418fc354bca792023b387cb837a3/src/http/server.ts#L338-L411

Point of interest: Thorium (the desktop app) is currently the most active user of the streamer software component, which powers the application's publication backend service. The server is started and killed automatically based on whether or not publications are opened. Publications are removed from the streamer's internal cache as soon as all opened windows are closed by the user. Naturally, this memory management strategy isn't applicable to a real network client/server context, but it shows versatility. A future revision of r2-streamer-js might include an off-the-shelf cache invalidation strategy, such as least-recently-used / time window. Suggestions welcome! :)

Short-lived resource URLs (query parameter, signed expiry)

The streamer now supports a special opt-in mode which is enabled by default when launching the server via the bundled command line interface, see the enableSignedExpiry in the Server constructor:

https://github.com/readium/r2-streamer-js/blob/58b83ffa33da626faefa9d4f906e69ef27f935bd/src/http/server-cli.ts#L115

When enabled, this feature will look for a runtime environment variable in the NodeJS process called R2_STREAMER_URL_EXPIRE_SECRET. As its name implies, this provides a "secret" string of characters specific to the deployed and running streamer instance. This value should of course NEVER be committed to a public code repository or service documentation, but note that the environment variable is evaluated just-in-time at runtime, it is not cached so it can be changed while the server is running, which will immediately "revoke" (i.e. break) all existing signatures / already-served URLs (thereby short-circuiting their auto-expiry). There are no constraints on the number of characters in the secret arbitrary string, as a fixed-length SHA256 hash digest will be derived from it. This will internally used as a "salt" for further signature computations.

A second optional environment variable can be provided: R2_STREAMER_URL_EXPIRE_SECONDS, which will override the default expiry of 86400 seconds (= 24h). For example, the test server deployed to Heroku demonstrates a really short expiry of 5 seconds:

https://github.com/readium/r2-streamer-js/blob/develop/Procfile

(to test this, press F5 to refresh the web browser page every ticking second on any given publication resource, such as the cover image, an HTML file, etc.: https://readium2.herokuapp.com )

When enabled, this optional "signed expiry" feature automatically adds a URL query parameter named r2tkn (short from "Readium2 token") to all href references present in any publication manifest.json served by the streamer. Note that any new HTTP request to a publication's manifest.json causes the short-lived tokens to be refreshed, but doesn't invalidate previous ones. Effectively, there is no server-side cache or database map of client sessions, the signed tokens simply auto-expire, or can be revoked en-masse by changing the aforementioned secret environment variable R2_STREAMER_URL_EXPIRE_SECRET.

A typical use-case would be that the HTTP route / API endpoint to manifest.json would be authenticated (e.g. http://127.0.0.1:3000/pub/_ID_/manifest.json), whereas publication resources such as audio files would be openly streamed to "authorised" (but not authenticated) client terminals such as smart speakers (e.g. http://127.0.0.1:3000/pub/_ID_/path/to/audio.mp3?r2tkn=...). For an audio book, an expiry time window of 24h seems sufficient, but this can of course be configured via the aforementioned corresponding environment variable R2_STREAMER_URL_EXPIRE_SECONDS.

Note that only resources listed in a publication's manifest.json receive the special &r2tkn=... URL query parameter, and client consumers are expected to use these URLs directly. For example, this means that HTML resources in a publication's reading order can be referenced directly from the manifest.json, but files linked from the HTML document (e.g. CSS, Javascript, images, etc.) do not automatically inherit the additional URL query parameter, even if these resources are duly listed in the manifest.json (in other words, there is no full URL rewrite across all served documents, which would require parsing CSS, HTML, JS on-the-fly). This optional "signed expiry" feature is therefore suitable for a limited number of use-cases, notably the audio streaming example.

Also note that individual resources in a given publication's manifest.json all receive a different &r2tkn=... URL query parameter, as this is bound to the resource path in the publication (i.e. not just the timestamp at which the publication manifest is generated). For example, in the test Heroku server this can be observed by pressing F5 repeatedly for the manifest.json HTTP route (see how the token differs for every individual resource). This prevents "replay" attacks where URL pathnames could be guessed (e.g. /path/to/audio1.mp3, /path/to/audio2.mp3, etc.) and automatically "syphoned" / batch-downloaded if they all shared the same signed token within a given publication manifest.json.