This is an open-source python repository that is based on the the HAIM GitHub package study. We replicate the embeddings generation from the HAIM multimodal dataset containing data of 4 modalities (tabular, time-series, text and images) and 11 unique sources. PS: Notes data are not publically available, so the embedidngs were not generated for that type of data. Also, the vision probabilties calculation was not used. In order to have access to those embeddings, please check the csv file generated by the study. Below is an overview of different types of data sources used and the transformation type that was used to generate the emebddings:
The datasets used to replicate the embeddings generation are publicly available at: physionet (https://physionet.org/content/haim-multimodal/1.0.1/)
Follow the instruction below to download and copy.
The datasets used to replicate the embeddings generation are publicly available at: physionet.
Download:
- MIMIC-CXR-JPG - chest radiographs with structured labels v2.0.0 (https://physionet.org/content/mimic-cxr-jpg/2.0.0/)
- MIMIC-IV v1.0 (https://physionet.org/content/mimiciv/1.0/
Copy the unzipped folders to csvs
Install the requirements under Python 3.9.13 as following:
$ pip install -r requirements.txt
In this repository, we intent to gradually provide five jupyter notebooks. Each of the first four will be for a data modality and the last one will be for all modalities.
In order to generate embeddings, we based our codes on subject_id. The user can also opt for stay_id embeddings generation. However, this can generate multiple rows for the same patient in terms of time series analysis. Data related to time of events will be spread on multiple rows, and machine learning algorithms might generate erroneous predictions.
For more details about the different tables and column names, please refere to MIMIC video tutorials at : MIMIC Tutorial
Below is an overview of the different MIMIC modules and their links to different patients movements through the hospital:
(Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2021). MIMIC-IV (version 1.0).PhysioNet. https://doi.org/10.13026/s6n6-xd98.).
To have an overview of different links between tables, check the folder Table diagrams
Below is table in which we summarize important information about the most important tables used in the embeddings generation.
To produce these number, we used the notebook general dataset exploration.ipynb
in the folder notebooks
The summary table shows for example that we have 382278 unique subject_id in the table patient (created from the csv file patients.csv). However, in the icustays table, we only have 53150 unique subject_id, meaning that not all patients in the database have icu stays.
Also, we notice that not all patients have chest radiology images: only 65379 unique subject_id in the mimic_cxr_chexpert table.
So in order to find the patients who have both icu stays and chest radiology images, we ran the notebook icu_cxr_patients.ipynb
and find that the number of patients with both a chest radiology image and an icu stay is: 20245
We recommand the user to start by running the notebook general tutorial notebook.ipynb
to be familiarised with the different tables and data in the mimic database.
At the end of that notebook, the user will have generated a sample of 10 patients that will be used for remaining of the work.
The second step is to generate features from demographic and time series data. In order to do so, the user should use the notebook Demographics_TimeSeries_features_Tutorial.ipynb
.
Example of generating chart events features:
import os
os.chdir('../..')
from src.data import constants
from src.utils import extraction_classes
import pandas as pd
chart_fusion = []
for patient in constants.cohort:
chart_fusion.append(extraction_classes.Event_extraction(patient).extract_chart_events(patient))
chart_fusion = pd.concat(chart_fusion, axis=0)
At the end of that notebook, the user will have generated a csv file fusion_ts_dem_dataframe.csv
that contains features from demographics
, chart events
, lab events
and procedure events
. That csv file will be used with the vision features file to create the final features csv file.
The third step is to generate features from image data. In order to do so, the user should use the notebook Extract_vision_features_Tutorial.ipynb
.
At the end of the notebook, the csv file fusion_vision.csv
will be generated.
Then we can use the function Generate_Final_Features
to concatenate all types of features to generate the embeddings file:
# Import the function from the module:
from src.utils.Generate_Final_Features import Generate_Final_Features
# Call the general extraction function to display results:
Generate_Final_Features()
# Export results to csv file:
Generate_Final_Features().to_csv('csvs/Final_Features.csv', index=False)