Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON and BIDS-Prov #146

Open
cmaumet opened this issue Nov 22, 2024 · 7 comments
Open

JSON and BIDS-Prov #146

cmaumet opened this issue Nov 22, 2024 · 7 comments

Comments

@cmaumet
Copy link
Collaborator

cmaumet commented Nov 22, 2024

Update proposal for BIDS Prov (BEP028)

By @yarikoptic in #125 (comment)

  • dissolving "Justification for Separating Provenance from file JSON" section by allowing generatedBy to be specified in the corresponding .json file, potentially with further relaxations such as not demanding id (assume to be unique) and overall have a clear schema which we could encode in our BIDS schema and validate
@yarikoptic
Copy link
Contributor

yarikoptic commented Nov 22, 2024

BTW, this would also be somewhat aligned with the effort of @surchs et al in https://github.com/OpenNeuroDatasets-JSONLD/ where instead of aiming at directly equipping openneuro BIDS datasets with .jsonld files "directly", they work on improving description of phenotypic data within JSON files so then there could be improved (over original BIDS) .jsonld files "derived" from the JSON files metadata. In other words: they are concentrating on the description of metadata in more concise/to the point way, relying on @context etc to be "encoded" elsewhere, to then be able to render the enriched .jsonld files. I think that is the path which BEP028 should allow for as well: to provide a way to "enter" provenance in concise, more "human accessible" form, to later be able to produce the "standard" prov.jsonld files. IMHO some additional aspects worth adding to "design" here

edit 1: IMHO think this issue is one of the most important aspects to facilitate adoption.

@cmaumet
Copy link
Collaborator Author

cmaumet commented Nov 25, 2024

@yarikoptic -- may I ask if you can help me keeping one issue per issue otherwise it's hard for me to keep track 🙏

dissolving "Justification for Separating Provenance from file JSON" section by allowing generatedBy to be specified in the corresponding .json file, potentially with further relaxations such as not demanding id (assume to be unique) and overall have a clear schema which we could encode in our BIDS schema and validate

"Justification for Separating Provenance from file JSON" is no longer present in the BIDS-Prov spec at https://bids.neuroimaging.io/bep028.

The discussion on how to fit with current BIDS "wasGeneratedBy" is in #148.

BTW, this would also be somewhat aligned with the effort of @surchs et al in https://github.com/OpenNeuroDatasets-JSONLD/ where instead of aiming at directly equipping openneuro BIDS datasets with .jsonld files "directly", they work on improving description of phenotypic data within JSON files so then there could be improved (over original BIDS) .jsonld files "derived" from the JSON files metadata. In other words: they are concentrating on the description of metadata in more concise/to the point way, relying on @context etc to be "encoded" elsewhere, to then be able to render the enriched .jsonld files. I think that is the path which BEP028 should allow for as well: to provide a way to "enter" provenance in concise, more "human accessible" form, to later be able to produce the "standard" prov.jsonld files. IMHO some additional aspects worth adding to "design" here

If I understand well this is very much related to #147, let's discuss over there? neurobagel is awesome but those tools did require a large amount of effort. For BIDS-Prov, I think we need to balance the complexity of the standard with the complexity of the tooling required to read the files...

schema for JSON records would be defined within BIDS schema (here is current for GeneratedBy) to be validatable by stock bids-validator

Yes, this is something I hope we can aim for in the current version of the BIDS-Prov standard. Note that amongst all the possible way to serialize JSON-LD graphs, the spec focused on two specific ways see "BIDS-Prov JSON-LD file" and "Alternative representation for file-level provenance JSON-LD". --> #149

develop good set of examples for already present common use-cases e.g.
conversion software(s) version specification -- e.g. for dcm2niix and heudiconv
additional manual annotation

See "Examples" in the spec, I think we have a good set of examples... In particular the SPM ones (AFNI and FSL will require more work): https://github.com/bids-standard/BEP028_BIDSprov/tree/master/examples/from_parsers/spm Note that "BIDS-Prov is [...] limited to the capture of data processing, future considerations including other types of provenance are listed in section "Future perspectives" so I think "additional manual annotation" may be out of scope (depending on what this means sorry if I misunderstood). We'll focus on a DICOM to Nifti conversion example when we can with @bclenet --> #150

So I'll close this issue and we can continue discussing the various questions in the dedicated issues. @yarikoptic: if there is a separate point we need to discuss here, let me know and we re-open another specific issue.

@cmaumet cmaumet closed this as completed Nov 25, 2024
@yarikoptic
Copy link
Contributor

This might still be a separate issue overall of having a PROV record in .json sidecars/dataset_description.json whenever

See "Examples" in the spec, I think we have a good set of examples... In particular the SPM ones

BTW - "reason" for allowing the record within .json file: it would not be appreciated by many (HPC or not; inodes limits or "slower git checkout") if for every data file there would still be another .jsonld file. Absorbing all of them into a single file (per dataset or other level) would complicate locating corresponding PROV record for a file, require tooling. Hence I feel that allowing for concise "high level" description ("pipeline/workflow level summary") in the JSON sidecar would be very important to be allowed.

@yarikoptic
Copy link
Contributor

I will reopen this issue since as I stated above I think it best describes the specific aspect of allowing PROV record within a regular sidecar .json file.

@yarikoptic
Copy link
Contributor

See "Examples" in the spec, I think we have a good set of examples...

those are nice indeed! But they aim for BIDS derivative datasets. There, indeed, might be worth making tools to just dump a big .jsonld per each subject/session or above and "be done" without fears to abuse inodes on the cluster, or that users would need to "tune" them later. But if we start talking about "raw" BIDS datasets, in my experience, even with automations like heudiconv etc, there some times A LOT of curation going on to make them proper. Some times with tools which might also like to add their PROV records.

future considerations including other types of provenance are listed in section "Future perspectives" so I think "additional manual annotation" may be out of scope (depending on what this means sorry if I misunderstood).

where is that section? I failed to git grep -i 'perspectives' in this repo.

@cmaumet
Copy link
Collaborator Author

cmaumet commented Nov 27, 2024

those are nice indeed! But they aim for BIDS derivative datasets. There, indeed, might be worth making tools to just dump a big .jsonld per each subject/session or above and "be done" without fears to abuse inodes on the cluster, or that users would need to "tune" them later. But if we start talking about "raw" BIDS datasets, in my experience, even with automations like heudiconv etc, there some times A LOT of curation going on to make them proper. Some times with tools which might also like to add their PROV records.

The main focus of BIDS-Prov is indeed derived datasets. We'll have a look with @bclenet on your proposal to have an example of DICOM to nifty conversion (see #150) but let's see how feasible this is / how much we need to tweak the model for that

where is that section? I failed to git grep -i 'perspectives' in this repo.

The spec is in the google doc available at: https://bids.neuroimaging.io/bep028 :)

About #146 (comment) To me the discussion about json and wasGeneratedBy is already in #151, can we use that issue instead of the current one (that overlaps many ideas?)

@yarikoptic
Copy link
Contributor

About #146 (comment) To me the discussion about json and wasGeneratedBy is already in #151, can we use that issue instead of the current one (that overlaps many ideas?)

#151 is about "descriptions". Did you mean

IMHO those two are largely independent of this one, as they could potentially be solved by direct conversion-into or integration-with .jsonld representation, whenever this one is about having representation at .json sidecar files level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants