This is a repository to help track issues and progress on the MIMIC IV to OMOP ETL process
It contains an abbreviated version of the MIMIC ETL and Extract processes. All SQL logic is available and could be executed manually against a Databricks instance, but the python code is not executable in this form without an array of helper functions.
With regard to linking the structured EHR data with multimodal observational data like waveform and image files, the general approach is:
The basic approach for managing and ensuring common unique IDs across EHR, waveform, and imaging files will be to associate each non-EHR file (e.g. an ECG file for a particular lead) to record with a unique autogenerated Procedure_Occurrence_ID in the Procedure_Occurrence table in the site's local OMOP instance. Each record in the Procedure_Occurrence table has a Visit_Detail_ID field that is a foreign key for linking each procedure occurrence to a distinct Visit_Detail table record (e.g. an ICU admission) within a visit (a period of time covering hospital admission to discharge) for a distinct patient with its own unique ID and record in the Visit_Occurrence table.
Each record in the Procedure_Occurrence table has both the local and the standard representations of the procedure that produced the associated non-EHR file. For example, a record in the Procedure_Occurrence table will have both the CPT4 code for Intracerebral Electroencephalogram (1864009) and its associated Standard OMOP concept (4295432) and the unique person_ID for the patient (autogenerated in the Person table for each patient during ETL) and the unique Procedure_Occurrence_ID (also autogenerated during ETL).
To link that record in Procedure_Occurrence to the EEG waveform data file acquired during the ICU part of that patient's visit we could:
- Add a new column to the Procedure_Occurrence table and populate it with a unique ID for each linked file generated by the associated procedure.
- Embed the Procedure_Occurrence_ID in the name of the non-EHR file
- Encode the Procedure_Occurrence_ID in the header of the non-EHR file or an associated metadata file
Based on the above concept, the provisional structure of the MIMIC OMOP + Multimodal Extract is as follows:
- When parsing all imaging and waveform files, I created a ‘registry’ table for each modality (e.g. image_registry, waveform_registry) that contains the following columns:
- Autogenerated File Identifier Integer
- SUBJECT_ID (person source value)
- PERSON_ID (OMOP person identifier)
- SESSION_ID (MIMIC imaging/waveform session grouping)
- DATE
- TIME
- SRCFILE (full name of source file)
- TRGFILE (full name of target file in DataAcquisition format)
- These registry tables are meant to serve as a precursor to the Imaging (and likely, Waveform) extension tables in OMOP, and I expect to expand them with other metadata fields that are relevant for those extensions
- For now, I have used the registry tables to create per-file entries in the procedure_occurrence table with the following attributes:
- PROCEDURE_OCCURRENCE_ID: Autogenerated File Identifier + Constant Buffer
- I used 2000000000 for waveform files (Procedures from 2000000000 to 2000000300 or so)
- I used 2001000000 for image files (Procedures from 2001000000 to 2001000500 or so)
- PERSON_ID: PERSON_ID in the Registry files
- PROCEDURE_CONCEPT_ID: Generic concept_id values for imaging and monitoring
- 4141651 (Measuring and Monitoring Procedure) for Waveforms
- 4180938 (Imaging Procedure) for Images
- PROCEDURE_DATE(TIME): DATE (+TIME) fields from the registries
- VISIT_OCCURRENCE_ID: Not yet linked for MIMIC, I assume that I will need to do a datetime-based match, but this can be done in a cohort creation process without an explicit visit, assuming the patient has an associated visit in the EHR data
- TBD with Brian
- VISIT_DETAIL_ID: Not yet linked for MIMIC, I assume that I will need to do a datetime-based match, but this can be done in a cohort creation process without an explicit visit, assuming the patient has an associated visit in the EHR data
- TBD with Brian
- PROCEDURE_SOURCE_VALUE: Full file name of the file associated with the procedure record (TRGFILE in registry)
- PROCEDURE_OCCURRENCE_ID: Autogenerated File Identifier + Constant Buffer
- Thoughts about next steps in developing this further
- Add tables for both the Imaging (and analogous waveform) extension, as to not swamp the MEASUREMENT table with features
- Different modalities for imaging and waveform signals should be mapped more specifically than the generic two procedures listed above
- Shift the registry ‘PROCEDURE’ entries to their associated extension tables with additional metadata captured in those registries (e.g. image_occurrence and wf_occurrence)
- Add in feature extraction processes during the iterative file parsing to populate the feature-based tables (analogous to NOTE_NLP)