-
Notifications
You must be signed in to change notification settings - Fork 5
Metadata lifecycle
- Needs to be refined during https://github.com/OP-TED/ted-rdf-conversion-pipeline/issues/554
The Operational Metadata is collected to enable Data Operations
When the Operational Metadata is collected?
Metadata can be generated at different points in the process:
- Before the transformation: Capturing what is already known about the pipeline and mappings.
- During transformation: Logging performance metrics or key events.
- After transformation: Recording the results and outcomes of the job.
When to collect the metadata will depend on the transformation method. For example, in a streaming scenario, part-whole relations and count statistics may be captured during step 2, while in a batch process, these statistics may occur in step 3.
Data operations happen at two levels (See:Data Operations#Granularity), for batches and notices. The simplest is to maintain one JSON document for each batch and notice, containing the Operational Metadata.
Note: The metadata can be stored in current databases simplifying https://github.com/OP-TED/ted-rdf-conversion-pipeline/issues/553. Alternatively, metadata can be logged as quads in a TRIG file, specifying named graphs.
It should be possible to do queries to support Data Operations. Additionally, the stored metadata of batches and notices must be accessible to downstream systems through a URL, making easy its consumption. The metadata later on can be transformed into RDF to be linked or included in a Data Catalog.
There exists only one metadata document for each Batch or Notice. A proposed approach for updating metadata is as follows:
- On Success: The metadata document is upserted, ensuring that the most recent information is always available.
- On Failure: Failure events are appended to the existing metadata document. This maintains a history of failures until the job succeeds.