Skip to content

Minutes Data Working Group 17 Jun 2021

Brad edited this page Jun 20, 2021 · 3 revisions

Agenda

  1. Issues tagged with Data WG Label
    1. Issue 857: Experiments Proposal: Customization, Reproducibility and Extensibility - Feedback Welcomed
  2. Discussions tagged with Data WG Label
    1. Discussion 2110: Covid19 segmentation
    2. Discussion 2079: Error with PersistentDataset in pytorch distributed setting
  3. Recent discussions from MONAI Advisory committee
    1. Discussion 2212: Proposal feedback wanted: CSV files to ingest image and patient data from TCIA
    2. Cardiology-specific needs (relates to call for society involvement)
    3. Pathology working group under way
    4. Real-time video
  4. Development updates relating to Data WG

Notes

  • Discussion 2212: Proposal feedback wanted: CSV files to ingest image and patient data from TCIA

    • Work is underway right now planning how CSV usage is to be implemented, and structure is of high priority
    • Decision taken to use CSV to make adoption easier now, rather than standardize a structure on a formal ontology (like FHIR or DICOM)
    • Question is, how do we map the fields from CSV to MONAI?
      • Currently, a CSV column name is mapped into an object in code which can be then called (e.g., "image" and "image_full" can be retrieved)
    • CSV poses a problem because there is no connection from a column in a CSV file to a real-world data representation; there also lacks any capability of strong-typing or data validation
    • Recommend considering a strongly-typed “business mapping file” that goes along side the CSV file, with a proper definition of the columns in CSV
      • E.g., see this example : https://www.w3.org/2013/csvw/wiki/CSV-LD
      • It should describe the field type, and also the ontology / best reference; e.g., reference the source of data as a DICOM tag, an HL7 field, a SNOMED or LOINC value, etc.
      • This permits data validation; e.g., if the CSV references files that don’t exist, MONAI should crash gracefully when loading images.
      • Data import from CSV should run some form of validation (e.g., checking the value matches its strongly typed type; a URL, a string, a number, a file reference) and it can do validation checks (e.g., is it a string? does the filke exist? can I retrieve the URL?)
    • Sources of data in a CSV file can be problematic; e.g., a DICOM reference (study UID) with IP address, port number, AE title repeated 75,000 times can be tedious
      • It’s not enough if this is stored in a config file somewhere; e.g., how do you specify the different locations of DICOM files (e.g., the DIMSE endpoint IP/Port/AE, or the DICOM web endpoint?) and what happens when it changes (replicated to a new site, and the endpoint changes) , or when data is in 2 different locations (e.g., a current PACS and a VNA for long term archive, or a rad PACS and a cardio PACS, etc).
      • A “business mapping file” (described above) perhaps can also provide tips on where to retrieve DICOM from
    • How does multiple records of the same patient be “linked” in CSV? E.g., if you look at two images in a "subject" in a CSV file? How do joining and merging would work?
      • E.g. check out https://tadpole.grand-challenge.org/Data/ as an example
      • This dataset has 12+ CSV files that are merged and linked somehow; there must be primary / foreign keys described in CSV
      • This can become unwieldly to maintain as they get more complicated
      • At this stage, MONAI loading CSVs would need to be pre-processed; if there are multiple CSVs they would need to be merged down / pre-processed down to 1 CSV file that would be loaded to MONAI
  • Issue 857: Experiments Proposal: Customization, Reproducibility and Extensibility - Feedback Welcomed

    • WG should review offline and add comments perhaps?
  • Discussion 2110: Covid19 segmentation

    • This is coming together, and will be released in the future, waiting for publication and discussion with collaborators
    • One option - AWS open data might be a good way to work on this
    • Format needs to be defined for usage
  • Discussion 2079: Error with PersistentDataset in pytorch distributed setting

    • This issue is fixed, see issue 2086
    • Open question remains, should caching dataset be done for re-use? Internal representations are not exchanged
  • Other Business

    • Evaluation WG is reaching out to MICCAI organizers, to see if some samples of training / testing data, with the types of data / structures of data, can be shared
      • Desire is to determine whether any preprocessing of data needed?
      • Maybe only a subset of data just to see what it looks like; e.g. handling multiple labels.
Clone this wiki locally