-
Notifications
You must be signed in to change notification settings - Fork 10
DAISY2.02 audiobooks with TOC PageList conversion to Readium WebPub Manifest
-
node --version
=>18
(or greater) -
npm --version
=>9
(or greater)
npm install json-diff --global
npm install r2-shared-js --global
npm install r2-streamer-js --global
In case of filesystem permission failures, try with sudo
in Linux and Mac, or in Windows try opening the shell with "run as administrator" (sometimes --unsafe-perm=true
helps too)
Verify installed "binaries" (i.e. globally-available NodeJS scripts):
-
which r2-shared-js-cli
=>/usr/local/bin/r2-shared-js-cli
(for example, on Mac) -
which r2-streamer-js-server
=>/usr/local/bin/r2-streamer-js-server
(for example, on Mac)
Note that a future revision of the CLI utilities will include a UNIX "shebang" at the top of the JS file in order to automatically invoke Node executable. See below for example on how to start the scripts.
Assuming some EPUB files are present inside a folder path (replace PATH_TO_EPUB_FOLDER
with your own filesystem location, which can be absolute or relative to the current pwd
folder):
-
DEBUG=r2:* node /usr/local/bin/r2-streamer-js-server PATH_TO_EPUB_FOLDER
(note thatDEBUG=r2:*
is optional, but useful to display runtime information in the console ... for even more verbosity, useDEBUG=*
) - Open a web browser with URL
http://127.0.0.1:3000
(as indicated in the console) - Click on any blue link at the top of the page (each link corresponds to an EPUB file discovered inside the folder, but note that subfolders are not scanned by this simple server demo / test CLI)
- Click on the
./manifest.json/show/all
link, this will display the Readium WebPub Manifest with clickable links to resources (images, CSS, HTML, etc.) - Note that the
http://127.0.0.1:3000/pub/_ID_/manifest.json
URL endpoint (without/show/all
) serves the raw JSON resource, which is probably what a real world deployment would use. The/show/all
URL is here to facilitate debugging / exploration of Readium WebPub Manifest JSON. - In the above URL, the
_ID_
token represents the "unique identifier" of the publication served by the streamer software component. This is notdc:identifier
/ ISBN / UUID, etc., this is in fact the base64 encoding of the publication's filepath.
A production deployment of the r2-streamer-js
would typically not use the built-in CLI as-is (i.e. https://github.com/readium/r2-streamer-js/blob/develop/src/http/server-cli.ts ), but instead a smarter CLI should be implemented to meet real-world needs. The core server runtime can be created with the following lines of code:
const server = new Server({
// options
});
server.preventRobots(); // for example
server.addPublications(files); // <=== this can be called any time after the server starts (incremental add/remove of publications, cache management)
const url = await server.start(0, false);
The CLI that ships with the r2-streamer-js
NPM package now includes a "file watcher" mode. This is activated when the environment variable STREAMER_WATCH
is set to value 1
. For example on MacOS: DEBUG=r2:* STREAMER_WATCH=1 node /usr/local/bin/r2-streamer-js-server PATH_TO_EPUB_FOLDER
.
When "watch" mode is activated, publication files added inside the PATH_TO_EPUB_FOLDER
folder will automatically be served by the streamer instance. In other words, no need to restart the server when the contents of the PATH_TO_EPUB_FOLDER
folder change. Removed files will also be reflected automatically in the streamer instance. Note that renaming a file effectively consists of removing the previous filename and adding the new filename. In fact, the same remove+add sequence occurs when a file is updated, so that any cached publication manifest in the streamer is invalidated before being served again.
The console debug messages (text print on standard shell output) include useful information to verify that file adding/removing is detected correctly by the live streamer process.
Note that if the server is started with STREAMER_WATCH=1
, then the file watcher can subsequently be paused/resumed simply by changing the process environment variable at runtime (i.e. STREAMER_WATCH=0
to pause file detection, STREAMER_WATCH=1
to resume). Naturally, if file are removed from the filesystem while the watcher is in "paused" state, these files are not removed from the active streamer instance and consequently any further attempt to access the publications resources will result in failure. In other words, resuming the watcher with STREAMER_WATCH=1
will not magically synchronise the state of the filesystem with the internal streamer state.
There is special treatment for filenames that end in _manifest.json
(e.g. BOOK.epub_manifest.json
), as these indicate that they are associated with the original publication filename (e.g. BOOK.epub_manifest.json
==> BOOK.epub
). In this example, when BOOK.epub
is deleted or renamed from the PATH_TO_EPUB_FOLDER
folder, BOOK.epub_manifest.json
(if it exists) is also automatically removed from the streamer instance (i.e. it is not served anymore). Conversely, if BOOK.epub
re-appears in the publication folder, and there is a BOOK.epub_manifest.json
file present in the folder, then both are added to the streamer instance. Finally, when a file named BOOK.epub_manifest.json
appears in the folder, it is only added to the streamer instance if its associated publication BOOK.epub
exists on the filesystem (the same check is made on initial server startup, to avoid serving "orphan" externalised _manifest.json
files). On first impressions, it would seem redundant to serve both the original/untouched publication and its externalised manifest.json
, but in fact this could be useful to separately stream an original EPUB as well as its tweaked external manifest.json
(e.g. modified metadata).
Note that for technical and historical design reasons, the CLI utility shipped with the r2-streamer-js
NPM package supports only a "flat" publication folder (i.e. not deep / recursive), when the server is initially started. However the file watcher itself will detect adding/removing/renaming/changing files deep inside the publication folder. Any file that ends with a supported extension (currently: .epub
, .epub3
, .cbz
, .audiobook
, .lcpaudiobook
, .lcpa
, .divina
, .lcpdivina
) or with the special _manifest.json
suffix will be included in the watcher detection. Files that end with extension .zip
or .daisy
are not directly supported by the streamer for "just in time" processing, as they require expensive "ahead of time" transformation using the r2-shared-js
CLI utility.
Finally, note that removing publication files causes an attempt to dispose filesystem resources (e.g. close the underlying ZIP file handle). Existing HTTP connections to the publication resources will therefore fail (with streaming audio/video media, the client-side behaviour will depend on what state the playback buffer is, on HTTP caching, on the sequence of HTTP 1.1 byte range requests, etc.).
Assuming some DAISY2.02 audio-only (see note below) publications are present inside a folder path (replace PATH_TO_DAISY_FOLDER
with your own filesystem location, which can be absolute or relative to the current pwd
folder):
NOTE: since v1.0.69, full synchronized text/audio publications are supported too (see https://github.com/readium/r2-shared-js/blob/develop/CHANGELOG.md#1069 )
-
DEBUG=r2:* node /usr/local/bin/r2-shared-js-cli PATH_TO_DAISY_FOLDER/book.zip PATH_TO_DAISY_FOLDER generate-daisy-audio-manifest-only
(note thatDEBUG=r2:*
is optional, but useful to display runtime information in the console ... for even more verbosity, useDEBUG=*
)
In the above example, PATH_TO_DAISY_FOLDER/book.zip
refers to a zipped DAISY fileset, but the command works with exploded / unzipped contents too:
DEBUG=r2:* node /usr/local/bin/r2-shared-js-cli PATH_TO_DAISY_FOLDER/book/ PATH_TO_DAISY_FOLDER generate-daisy-audio-manifest-only
When the DEBUG
flag is used, the console displays the following in case of success: DAISY audio only book => manifest-audio.json (generateDaisyAudioManifestOnly ***_manifest.json)
and generateDaisyAudioManifestOnly OK: PATH/TO/*_manifest.json
and DAISY-EPUB-RWPM done.
The Readium WebPub Manifest JSON files are created based on the original DAISY filename, for example: book.zip_manifest.json
or book_manifest.json
with the unzipped folder. This file naming convention is critical, the DAISY and JSON file names must be kept in sync.
Note that the generate-daisy-audio-manifest-only
command line parameter cannot be used with text-only publications, for obvious reasons. When this CLI parameter is omitted, a full .webpub
zipped publication is generated in the destination folder instead of just the JSON manifest. The full conversion process involves renaming files from the original DAISY fileset (notably, XML vs. HTML vs. XHTML file extensions), and other files are created too (notably, DAISY3 DTBOOK to XHTML, or SMIL to XHTML). With audio-only books, the generated .webpub
archive can be unzipped to reveal both manifest.json
(i.e. the default one which relies on EPUB3 Media Overlays SMIL in order to preserve the phrase-level DAISY navigation) and manifest-audio.json
(i.e. the simplified audiobook with reading order, TOC, pagelist, but no phrase-level navigation). A .webpub
file can directly be ingested by the streamer via server.addPublications()
.
Now, simply start the "r2-streamer-js" test server inside the folder that contains the generated JSON files and original DAISY filesets, in order to demonstrate them working together: DEBUG=r2:* node /usr/local/bin/r2-streamer-js-server PATH_TO_DAISY_FOLDER
. The CLI offers an easy way to test the server, but in a real-world scenario the server.addPublications(files)
Javascript function would be called after the server is started to enable the on-demand streaming of the Readium WebPub Manifest JSON. For example server.addPublications([PATH_TO_JSON_FILE])
, and the streamer will automatically find the corresponding original DAISY book based on the common root filename.
Note that the current r2-streamer-js
implementation does not provide an out-of-the-box caching / memory management solution. It is therefore recommended to write additional logic based on server.removePublications(files)
or server.uncachePublication(file)
in order to ensure that the streamer runtime does not allocate unnecessary memory, and does not keep filesystem handles open during access to zipped publications or unzipped folders.
See:
https://github.com/readium/r2-streamer-js/blob/a2faa6140074418fc354bca792023b387cb837a3/src/http/server.ts#L304-L332
and:
https://github.com/readium/r2-streamer-js/blob/a2faa6140074418fc354bca792023b387cb837a3/src/http/server.ts#L338-L411
Point of interest: Thorium (the desktop app) is currently the most active user of the streamer software component, which powers the application's publication backend service. The server is started and killed automatically based on whether or not publications are opened. Publications are removed from the streamer's internal cache as soon as all opened windows are closed by the user. Naturally, this memory management strategy isn't applicable to a real network client/server context, but it shows versatility. A future revision of r2-streamer-js
might include an off-the-shelf cache invalidation strategy, such as least-recently-used / time window. Suggestions welcome! :)
The streamer now supports a special opt-in mode which is enabled by default when launching the server via the bundled command line interface, see the enableSignedExpiry
in the Server
constructor:
When enabled, this feature will look for a runtime environment variable in the NodeJS process called R2_STREAMER_URL_EXPIRE_SECRET
. As its name implies, this provides a "secret" string of characters specific to the deployed and running streamer instance. This value should of course NEVER be committed to a public code repository or service documentation, but note that the environment variable is evaluated just-in-time at runtime, it is not cached so it can be changed while the server is running, which will immediately "revoke" (i.e. break) all existing signatures / already-served URLs (thereby short-circuiting their auto-expiry). There are no constraints on the number of characters in the secret arbitrary string, as a fixed-length SHA256 hash digest will be derived from it. This will internally used as a "salt" for further signature computations.
A second optional environment variable can be provided: R2_STREAMER_URL_EXPIRE_SECONDS
, which will override the default expiry of 86400
seconds (= 24h). For example, the test server deployed to Heroku demonstrates a really short expiry of 5 seconds:
https://github.com/readium/r2-streamer-js/blob/develop/Procfile
(to test this, press F5 to refresh the web browser page every ticking second on any given publication resource, such as the cover image, an HTML file, etc.: https://readium2.herokuapp.com )
When enabled, this optional "signed expiry" feature automatically adds a URL query parameter named r2tkn
(short from "Readium2 token") to all href
references present in any publication manifest.json
served by the streamer. Note that any new HTTP request to a publication's manifest.json
causes the short-lived tokens to be refreshed, but doesn't invalidate previous ones. Effectively, there is no server-side cache or database map of client sessions, the signed tokens simply auto-expire, or can be revoked en-masse by changing the aforementioned secret environment variable R2_STREAMER_URL_EXPIRE_SECRET
.
A typical use-case would be that the HTTP route / API endpoint to manifest.json
would be authenticated (e.g. http://127.0.0.1:3000/pub/_ID_/manifest.json
), whereas publication resources such as audio files would be openly streamed to "authorised" (but not authenticated) client terminals such as smart speakers (e.g. http://127.0.0.1:3000/pub/_ID_/path/to/audio.mp3?r2tkn=...
). For an audio book, an expiry time window of 24h seems sufficient, but this can of course be configured via the aforementioned corresponding environment variable R2_STREAMER_URL_EXPIRE_SECONDS
.
Note that only resources listed in a publication's manifest.json
receive the special &r2tkn=...
URL query parameter, and client consumers are expected to use these URLs directly. For example, this means that HTML resources in a publication's reading order can be referenced directly from the manifest.json
, but files linked from the HTML document (e.g. CSS, Javascript, images, etc.) do not automatically inherit the additional URL query parameter, even if these resources are duly listed in the manifest.json
(in other words, there is no full URL rewrite across all served documents, which would require parsing CSS, HTML, JS on-the-fly). This optional "signed expiry" feature is therefore suitable for a limited number of use-cases, notably the audio streaming example.
Also note that individual resources in a given publication's manifest.json
all receive a different &r2tkn=...
URL query parameter, as this is bound to the resource path in the publication (i.e. not just the timestamp at which the publication manifest is generated). For example, in the test Heroku server this can be observed by pressing F5 repeatedly for the manifest.json
HTTP route (see how the token differs for every individual resource). This prevents "replay" attacks where URL pathnames could be guessed (e.g. /path/to/audio1.mp3
, /path/to/audio2.mp3
, etc.) and automatically "syphoned" / batch-downloaded if they all shared the same signed token within a given publication manifest.json
.