title |
---|
Background |
{: .no_toc }
{: .no_toc .text-delta }
- TOC {:toc}
A Research Object (RO) provide a machine-readable mechanism to communicate the diverse set of digital and real-world resources that contribute to an item of research. The aim of an RO is to replace traditional academic publication as a PDF with a couple of supplementary materials; to instead provide a structured archive of all the items that contributed to the research outcome, including their identifiers, provenance, relations and annotations.
This is of particular importance as all domains of research and science are increasingly relying in computational analysis, yet we are facing a reproducibility crisis because key components are not sufficiently tracked, archived or reported.
Examples of items that should be included in a Research Object:
- Manuscripts and preprints
- Lab notebooks
- Data (raw and processed)
- Computational workflows and scripts
- Results (graphs, derived data)
- Slides
- Metadata
- Computational logs
The Research Object initiative have iteratively been developing specifications for machine-readable formats of communicating Research Objects. The formalization of Research Object is a combination of existing Linked Data standards:
- W3C RDF, primarily as JSON-LD
- OAI-ORE for aggregating resources
- W3C Web Annotation Data Model for linking and relating to describing resources and the RO.
- W3C PROV and PAV to provide provenance, authorship and attributions for resources, the RO and the annotations.
- Dublin Core Terms for common metadata (title, descriptor, format)
- ORCID for identifying people
The Research Object ontologies were created to combine the above and add a few missing pieces making the combined vocabulary for describing ROs, but to do not themselves formalize how the Research Object is saved or transmitted.
The existing RO formats have been used for portal systems like RO Hub using RDF as the common serialization across REST resources; or for generation of RO bundle ZIP files and BagIt archives by workflow systems like Apache Taverna and Common Workflow Language.
In all of these instansiations the RO consists of an outer manifest that lists:
- Identify of the Research Object
- List of aggregated resources
- List of annotations that further describe resources
- Basic provenance and typing of the RO and its resources
The Research Object manifest is saved in a resource called .ro/manifest.rdf
(REST), .ro/manifest.json
(Bundle) or metadata/manifest.json
(Bagit) depending on the serialization.
As a separation of concern, anything more detailed was delegated to separate annotation files linked from the manifest, allowing them their own provenance, format, vocabularies and scope. The role of the manifest in this scenario was thus to provide the glue between the resources, saying which annotations describe which resources, using which format.
For instance, a workflow engine like Apache Taverna can include its native workflow definition file as an aggregated resource, and generate a simplified wfdesc annotation file that show the structure of the workflow; the provenance would show that the annotation file was generated by the software (thus flagged for updating of the workflow was changed). Similarly the RO Hub allow collaborative description of individual resources - saving these as separate annotation resources mean the Research Object also keeps track of who described what resource when.
The composition model above is facilitating machine-generated ROs with rich provenance tracking and extraction of metadata from existing formats.
However work on BDBag Research Objects highlighted that this separation approach becomes overly complicated for human-edited Research Objects and advocated embedding the annotation content inside the manifest JSON.
Recent advances like schema.org and BioSchemas have simplified vocabularies for common metadata descriptions and made JSON-LD mainstream, negating some reasons for separating such annotations to separate files.
The freedom of separate annotations mean that consumers of ROs would not know what to expect inside their content - increasing the importance of formalizing Research Object profiles. Previous work such as carefully crafted Minimum Information Models and more recent work on RDF Shapes provide RO validation that takes into account annotations, but what remains is how to communicating to developers and end-users how to programmatically generate ROs consistent for a particular profile.
The recent DataCrate approach points out that the existing RO manifest forces too much structure for simpler use-cases of general who-what-where type metadata, and advocates a simpler approach based primarily on schema.org in a JSON-LD CATALOG.json
, but also mandating a human-readable HTML CATALOG.html
.
Starting as a community project from the beginning, the RO-Crate specification evolved from DataCrate to add Research Object aspects and further formalize the recommendations.