-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Minutes Data Working Group 17 Jun 2021
Brad edited this page Jun 20, 2021
·
3 revisions
- Issues tagged with Data WG Label
- Issue 857: Experiments Proposal: Customization, Reproducibility and Extensibility - Feedback Welcomed
- Discussions tagged with Data WG Label
- Discussion 2110: Covid19 segmentation
- Discussion 2079: Error with PersistentDataset in pytorch distributed setting
- Recent discussions from MONAI Advisory committee
- Discussion 2212: Proposal feedback wanted: CSV files to ingest image and patient data from TCIA
- Cardiology-specific needs (relates to call for society involvement)
- Pathology working group under way
- Real-time video
- Development updates relating to Data WG
-
Discussion 2212: Proposal feedback wanted: CSV files to ingest image and patient data from TCIA
- Work is underway right now planning how CSV usage is to be implemented, and structure is of high priority
- Decision taken to use CSV to make adoption easier now, rather than standardize a structure on a formal ontology (like FHIR or DICOM)
- Question is, how do we map the fields from CSV to MONAI?
- Currently, a CSV column name is mapped into an object in code which can be then called (e.g., "image" and "image_full" can be retrieved)
- CSV poses a problem because there is no connection from a column in a CSV file to a real-world data representation; there also lacks any capability of strong-typing or data validation
- Recommend considering a strongly-typed “business mapping file” that goes along side the CSV file, with a proper definition of the columns in CSV
- E.g., see this example : https://www.w3.org/2013/csvw/wiki/CSV-LD
- It should describe the field type, and also the ontology / best reference; e.g., reference the source of data as a DICOM tag, an HL7 field, a SNOMED or LOINC value, etc.
- This permits data validation; e.g., if the CSV references files that don’t exist, MONAI should crash gracefully when loading images.
- Data import from CSV should run some form of validation (e.g., checking the value matches its strongly typed type; a URL, a string, a number, a file reference) and it can do validation checks (e.g., is it a string? does the filke exist? can I retrieve the URL?)
- Sources of data in a CSV file can be problematic; e.g., a DICOM reference (study UID) with IP address, port number, AE title repeated 75,000 times can be tedious
- It’s not enough if this is stored in a config file somewhere; e.g., how do you specify the different locations of DICOM files (e.g., the DIMSE endpoint IP/Port/AE, or the DICOM web endpoint?) and what happens when it changes (replicated to a new site, and the endpoint changes) , or when data is in 2 different locations (e.g., a current PACS and a VNA for long term archive, or a rad PACS and a cardio PACS, etc).
- A “business mapping file” (described above) perhaps can also provide tips on where to retrieve DICOM from
- How does multiple records of the same patient be “linked” in CSV? E.g., if you look at two images in a "subject" in a CSV file? How do joining and merging would work?
- E.g. check out https://tadpole.grand-challenge.org/Data/ as an example
- This dataset has 12+ CSV files that are merged and linked somehow; there must be primary / foreign keys described in CSV
- This can become unwieldly to maintain as they get more complicated
- At this stage, MONAI loading CSVs would need to be pre-processed; if there are multiple CSVs they would need to be merged down / pre-processed down to 1 CSV file that would be loaded to MONAI
-
Issue 857: Experiments Proposal: Customization, Reproducibility and Extensibility - Feedback Welcomed
- WG should review offline and add comments perhaps?
-
Discussion 2110: Covid19 segmentation
- This is coming together, and will be released in the future, waiting for publication and discussion with collaborators
- One option - AWS open data might be a good way to work on this
- Format needs to be defined for usage
-
Discussion 2079: Error with PersistentDataset in pytorch distributed setting
- This issue is fixed, see issue 2086
- Open question remains, should caching dataset be done for re-use? Internal representations are not exchanged
-
Other Business
- Evaluation WG is reaching out to MICCAI organizers, to see if some samples of training / testing data, with the types of data / structures of data, can be shared
- Desire is to determine whether any preprocessing of data needed?
- Maybe only a subset of data just to see what it looks like; e.g. handling multiple labels.
- Evaluation WG is reaching out to MICCAI organizers, to see if some samples of training / testing data, with the types of data / structures of data, can be shared