JSON and BIDS-Prov #146

cmaumet · 2024-11-22T09:59:15Z

Update proposal for BIDS Prov (BEP028)

dissolving "Justification for Separating Provenance from file JSON" section by allowing generatedBy to be specified in the corresponding .json file, potentially with further relaxations such as not demanding id (assume to be unique) and overall have a clear schema which we could encode in our BIDS schema and validate

The text was updated successfully, but these errors were encountered:

yarikoptic · 2024-11-22T16:28:51Z

BTW, this would also be somewhat aligned with the effort of @surchs et al in https://github.com/OpenNeuroDatasets-JSONLD/ where instead of aiming at directly equipping openneuro BIDS datasets with .jsonld files "directly", they work on improving description of phenotypic data within JSON files so then there could be improved (over original BIDS) .jsonld files "derived" from the JSON files metadata. In other words: they are concentrating on the description of metadata in more concise/to the point way, relying on @context etc to be "encoded" elsewhere, to then be able to render the enriched .jsonld files. I think that is the path which BEP028 should allow for as well: to provide a way to "enter" provenance in concise, more "human accessible" form, to later be able to produce the "standard" prov.jsonld files. IMHO some additional aspects worth adding to "design" here

*prov.jsonld across different levels (jsonld at any level #144 , _prov.jsonld at different levels #145) would still be allowed but validator must be able to detect possible conflicts and incongruencies and warn or error
- TODO: identify common scenarios like that (e.g. records with the same id but conflicting metadata -- different other fields etc)
should be sufficiently expressive to accommodate current GeneratedBy ("Absorb"/migrate already defined in BIDS dataset_description.json GeneratedBy #148)
schema for JSON records would be defined within BIDS schema (here is current for GeneratedBy) to be validatable by stock bids-validator
develop good set of examples for already present common use-cases e.g.
- conversion software(s) version specification -- e.g. for dcm2niix and heudiconv
- additional manual annotation
cool feature idea (good-for-hackathon style)
- develop tool to automatically extract/enrich based on git history. Might already be in part addressed by https://docs.datalad.org/projects/metalad/en/latest/generated/datalad_metalad.extractors.runprov.html#module-datalad_metalad.extractors.runprov but I thought there was even more ...

edit 1: IMHO think this issue is one of the most important aspects to facilitate adoption.

cmaumet · 2024-11-25T14:32:40Z

@yarikoptic -- may I ask if you can help me keeping one issue per issue otherwise it's hard for me to keep track 🙏

dissolving "Justification for Separating Provenance from file JSON" section by allowing generatedBy to be specified in the corresponding .json file, potentially with further relaxations such as not demanding id (assume to be unique) and overall have a clear schema which we could encode in our BIDS schema and validate

"Justification for Separating Provenance from file JSON" is no longer present in the BIDS-Prov spec at https://bids.neuroimaging.io/bep028.

The discussion on how to fit with current BIDS "wasGeneratedBy" is in #148.

BTW, this would also be somewhat aligned with the effort of @surchs et al in https://github.com/OpenNeuroDatasets-JSONLD/ where instead of aiming at directly equipping openneuro BIDS datasets with .jsonld files "directly", they work on improving description of phenotypic data within JSON files so then there could be improved (over original BIDS) .jsonld files "derived" from the JSON files metadata. In other words: they are concentrating on the description of metadata in more concise/to the point way, relying on @context etc to be "encoded" elsewhere, to then be able to render the enriched .jsonld files. I think that is the path which BEP028 should allow for as well: to provide a way to "enter" provenance in concise, more "human accessible" form, to later be able to produce the "standard" prov.jsonld files. IMHO some additional aspects worth adding to "design" here

If I understand well this is very much related to #147, let's discuss over there? neurobagel is awesome but those tools did require a large amount of effort. For BIDS-Prov, I think we need to balance the complexity of the standard with the complexity of the tooling required to read the files...

schema for JSON records would be defined within BIDS schema (here is current for GeneratedBy) to be validatable by stock bids-validator

Yes, this is something I hope we can aim for in the current version of the BIDS-Prov standard. Note that amongst all the possible way to serialize JSON-LD graphs, the spec focused on two specific ways see "BIDS-Prov JSON-LD file" and "Alternative representation for file-level provenance JSON-LD". --> #149

develop good set of examples for already present common use-cases e.g.
conversion software(s) version specification -- e.g. for dcm2niix and heudiconv
additional manual annotation

See "Examples" in the spec, I think we have a good set of examples... In particular the SPM ones (AFNI and FSL will require more work): https://github.com/bids-standard/BEP028_BIDSprov/tree/master/examples/from_parsers/spm Note that "BIDS-Prov is [...] limited to the capture of data processing, future considerations including other types of provenance are listed in section "Future perspectives" so I think "additional manual annotation" may be out of scope (depending on what this means sorry if I misunderstood). We'll focus on a DICOM to Nifti conversion example when we can with @bclenet --> #150

So I'll close this issue and we can continue discussing the various questions in the dedicated issues. @yarikoptic: if there is a separate point we need to discuss here, let me know and we re-open another specific issue.

yarikoptic · 2024-11-25T18:48:29Z

This might still be a separate issue overall of having a PROV record in .json sidecars/dataset_description.json whenever

About context #147 on either to repeat context in every .json* file
"Absorb"/migrate already defined in BIDS dataset_description.json GeneratedBy #148 migrating existing GeneratedBy which at large relies on this JSON and BIDS-Prov #146 in its formulation but could also be migrating into .jsonld files.

See "Examples" in the spec, I think we have a good set of examples... In particular the SPM ones

BTW - "reason" for allowing the record within .json file: it would not be appreciated by many (HPC or not; inodes limits or "slower git checkout") if for every data file there would still be another .jsonld file. Absorbing all of them into a single file (per dataset or other level) would complicate locating corresponding PROV record for a file, require tooling. Hence I feel that allowing for concise "high level" description ("pipeline/workflow level summary") in the JSON sidecar would be very important to be allowed.

yarikoptic · 2024-11-27T01:33:33Z

I will reopen this issue since as I stated above I think it best describes the specific aspect of allowing PROV record within a regular sidecar .json file.

yarikoptic · 2024-11-27T02:04:53Z

See "Examples" in the spec, I think we have a good set of examples...

those are nice indeed! But they aim for BIDS derivative datasets. There, indeed, might be worth making tools to just dump a big .jsonld per each subject/session or above and "be done" without fears to abuse inodes on the cluster, or that users would need to "tune" them later. But if we start talking about "raw" BIDS datasets, in my experience, even with automations like heudiconv etc, there some times A LOT of curation going on to make them proper. Some times with tools which might also like to add their PROV records.

future considerations including other types of provenance are listed in section "Future perspectives" so I think "additional manual annotation" may be out of scope (depending on what this means sorry if I misunderstood).

where is that section? I failed to git grep -i 'perspectives' in this repo.

cmaumet · 2024-11-27T12:59:59Z

those are nice indeed! But they aim for BIDS derivative datasets. There, indeed, might be worth making tools to just dump a big .jsonld per each subject/session or above and "be done" without fears to abuse inodes on the cluster, or that users would need to "tune" them later. But if we start talking about "raw" BIDS datasets, in my experience, even with automations like heudiconv etc, there some times A LOT of curation going on to make them proper. Some times with tools which might also like to add their PROV records.

The main focus of BIDS-Prov is indeed derived datasets. We'll have a look with @bclenet on your proposal to have an example of DICOM to nifty conversion (see #150) but let's see how feasible this is / how much we need to tweak the model for that

where is that section? I failed to git grep -i 'perspectives' in this repo.

The spec is in the google doc available at: https://bids.neuroimaging.io/bep028 :)

About #146 (comment) To me the discussion about json and wasGeneratedBy is already in #151, can we use that issue instead of the current one (that overlaps many ideas?)

yarikoptic · 2024-11-27T16:56:07Z

About #146 (comment) To me the discussion about json and wasGeneratedBy is already in #151, can we use that issue instead of the current one (that overlaps many ideas?)

#151 is about "descriptions". Did you mean

"Absorb"/migrate already defined in BIDS dataset_description.json GeneratedBy #148 ?

IMHO those two are largely independent of this one, as they could potentially be solved by direct conversion-into or integration-with .jsonld representation, whenever this one is about having representation at .json sidecar files level.

cmaumet mentioned this issue Nov 22, 2024

Next steps for BIDS-Prov #125

Open

yarikoptic mentioned this issue Nov 22, 2024

"Absorb"/migrate already defined in BIDS dataset_description.json GeneratedBy #148

Open

4 tasks

yarikoptic mentioned this issue Nov 25, 2024

About context #147

Open

cmaumet mentioned this issue Nov 25, 2024

Dicom to nifti conversion usecase #150

Open

cmaumet closed this as completed Nov 25, 2024

yarikoptic reopened this Nov 27, 2024

yarikoptic mentioned this issue Nov 27, 2024

jsonld at any level #144

Open

yarikoptic mentioned this issue Nov 27, 2024

"Integrate" with already defined in BIDS descriptions and _desc entities #151

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON and BIDS-Prov #146

JSON and BIDS-Prov #146

cmaumet commented Nov 22, 2024

yarikoptic commented Nov 22, 2024 •

edited

Loading

cmaumet commented Nov 25, 2024

yarikoptic commented Nov 25, 2024

yarikoptic commented Nov 27, 2024

yarikoptic commented Nov 27, 2024

cmaumet commented Nov 27, 2024

yarikoptic commented Nov 27, 2024

JSON and BIDS-Prov #146

JSON and BIDS-Prov #146

Comments

cmaumet commented Nov 22, 2024

Update proposal for BIDS Prov (BEP028)

yarikoptic commented Nov 22, 2024 • edited Loading

cmaumet commented Nov 25, 2024

yarikoptic commented Nov 25, 2024

yarikoptic commented Nov 27, 2024

yarikoptic commented Nov 27, 2024

cmaumet commented Nov 27, 2024

yarikoptic commented Nov 27, 2024

yarikoptic commented Nov 22, 2024 •

edited

Loading