-
Notifications
You must be signed in to change notification settings - Fork 35
How To Extend
This module was made with extensibility in mind, leveraging Object-Oriented programming.
To help explain how exactly to extend, a bit of background on how the processing work is useful.
We break the ingest process into two steps:
- Preprocessing
- Ingest with derivative generation
Preprocessing consists of scanning the input to be ingested, and building up a queue in the database. The queue structure used consists of three tables:
-
islandora_batch_queue
: Describe the objects themselves and indicate a parent object which is also set to be ingested via batch (NULL
if none)... Currently, we serialize classes subclassing IslandoraBatchObject, which in turn subclasses Tuque's NewFedoraObject (though this will likely change in the near future) and keep track of their associated IDs. -
islandora_batch_state
: Track the state of each object, based on their ID. The "state" is just a numeric value. -
islandora_batch_resources
: Track resources with associated types for each ID. There is no strict idea of what a "resource" is... Though it is probably it will be something like one of the filenames or a URL associated with the given object. This is not used inside of the core batch code, but it can be very useful for diagnosis and recovery if/when things go wrong, or possibly leveraged to avoid ingesting the same file multiple times, between different preprocessing runs.
In the preprocessing phase, it is expected that one would add all datastreams which contain files to one's instance of one's IslandoraBatchObject subclass; this is not CPU intensive, as there should not be any derivatives created or communication with Fedora at this point. It is suggested that one avoid adding datastreams content as strings (either using a datastream's content
property, or the setContentFromString
method) if they can be computed later, as they will cause the serialized value in the database to be larger.
There is an abstract base class for preprocessors which can help out, but if preprocessing for a given source is going to be particularly CPU intensive or otherwise take a long time, one might be best off skip using the base class and populate the queue on your own. Discussion around this topic likely merits its own page.
The ingest process consists of grabbing objects from the queue which are ready to be ingested, calling the abstract batchProcess
method to allow whatever computation is necessary to produce a correct object, less anything implemented via derivatives. Current implementations use this method to produce MODS and crosswalk to DC, as we are outside of the usual ingest context where we would have an ingest form with a DC transform.
After the basics are completed in batchProcess
, we then just pass the object to islandora_add_object()
to be added to Fedora. This works due to subclassing the IslandoraNewFedoraObject
class. Derivatives are then generated normally through implementations of hook_islandora_derivatives()
.
Hierarchical ingests work by indicating the parent of a given object while preprocessing. If desired, the initial state of parent objects can be set such that all children will be ingested first. This feature helps to avoid generating and regenerating derivatives on the parent which aggregate children (as a PDF on a newspaper issue or a book might aggregate all of the pages "contained" there-in). The core of how hierarchies work is based on implementations of IslandoraBatchObject::getChildren()
and preprocessors. The core "scan" batch currently handles this (though it should possibly be broken out to be more reusable without the idea of "scanning" a directory or ZIP file; this "scan" batch is reused by the Islandora Book Batch directly.