Skip to content

Operational Metadata

Cristian Vasquez edited this page Oct 21, 2024 · 14 revisions

What is Metadata? Metadata refers to data that provides information about other data. In this context, operational metadata will include information related to data sources, job processes, and outcomes of transformations. Capturing this metadata is crucial for enabling effective data operations, tracking data lineage, and ensuring that downstream systems receive accurate and timely updates.

Operational Metadata

Operational metadata is a wide concept, but on this page, we refer to the metadata that supports the reprocessing of data in response to pipeline changes. It complements catalog-related metadata and includes several key components

  • Source Metadata: This type links to the resources being transformed. If the resource is a specific section of a document or package, the metadata must include selectors to identify the relevant part. It aids in replicating transformations, tracking errors at the source, and notifying upstream systems about data quality issues.
  • Job-specific metadata:It captures the details of processes that transform data batches, whether they originate from periodic transformation of packages in a folder or bucket, periodic fetching from an API, or any other custom set of sources.
  • Partition Metadata: This refers to a collection of notices grouped as a "Dataset" according to a specific batch strategy during the transformation process.
  • Notice Metadata: Each notice has its metadata, regardless of whether it successfully produces RDF output. This metadata resides in a dedicated document and can be transformed into RDF within a specific named graph, potentially the same as the one used in a downstream Triplestore.

Levels of Granularity

Operational metadata will be available at two levels of granularity:

  • Batch level: Metadata describing sources, jobs, and data outcomes (partition).
  • Individual level: The metadata specific to a notice

Initial list of fields (draft)

Batch metadata

For simplicity, we will treat Source Metadata, Job Metadata, and Partition Metadata as a unified set.

Access URL: Similar to the URL of the Airflow run.

  • Transformation System identifier (docker image)
  • Source metadata:
    • URL to the TEDMON file (if applicable).
    • URL to the run configuration (if applicable).
    • URL to the query used to trigger the batch (if applicable).
  • Job metadata
    • Link to the Airflow run (same as named-graph)
    • Links to job failures and other key event logs (if applicable)
    • Event start timestamp
    • Event end timestamp
  • Partition metadata
    • Total number of triples
    • Number of raw data instances
    • Number of successfully transformed data instances
    • Number of data instances with errors
    • Number of data instances skipped due to unavailable mappings
  • Metadata named-graph (The URL of the airflow run)
  • Metadata creation timestamp (when this metadata was created/updated)

Notice metadata

Access URL: Similar to the URL of notice, possibly with a postfix to define.

  • Notice Representation variants:
    • Links to HTML, PDF and XML notices in TED
  • Notice Identifiers:
    • Notice URI
    • Notice UUID
    • Publication number (if available)
    • OSJ number (if available)
    • Notice version (used to replace older versions or for other URI handling)
  • XML metadata:
    • SDK version
    • Notice subtype
    • Procedure UUID
    • XML size (used to priorize)
    • Publication date
  • Transformation:
    • Mapping package URI (preferably the URL of the specific commit in GitHub)
    • Link to the corresponding #Batch metadata
    • Transformation status (success, failure, ommited)
    • Transformation duration
    • Notice transformation error logs (Airflow links + run ID) if applies
    • Error type (to define)
  • RDF Output:
    • Target ontology
    • Target SHACL profile
    • Target ontology version
  • Destination:
    • Target Cellar named-graph
    • Target Cellar metadata named-graph
  • Ethics and Privacy:
    • Details of when private data was disclosed or removed
  • Metadata named-graph (The URI of the notice + postfix)
  • Metadata creation timestamp (when this metadata was created/updated)

Metadata to support de-duplication (future)

Notice metadata may be linked to Metadata Repositories for further processing. For example, when a notice refers to multiple organizations that will require deduplication in the future, these need to be marked for re-processing when these organizations are definitively identified.