Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/extract #44

Draft
wants to merge 101 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
adf5e76
update gitignore
mathias-samuelides Nov 20, 2023
25a5ca5
add script to test extract
mathias-samuelides Nov 21, 2023
d5a961a
in progress
mathias-samuelides Nov 23, 2023
48ead4d
make extract work
mathias-samuelides Nov 24, 2023
74c0354
refact in progress
mathias-samuelides Nov 24, 2023
1798af9
change use_icu type
mathias-samuelides Nov 24, 2023
f50efc7
raw path
mathias-samuelides Nov 24, 2023
c27dbb4
rename file
mathias-samuelides Nov 24, 2023
e9a4e75
icd conversion
mathias-samuelides Nov 25, 2023
94258cc
refact extract
mathias-samuelides Nov 25, 2023
0222f06
minor
mathias-samuelides Nov 25, 2023
db9bf68
fix path in existing code to make it work
mathias-samuelides Nov 25, 2023
abfdb23
remove useless files
mathias-samuelides Nov 25, 2023
5251a46
start feature selection
mathias-samuelides Nov 27, 2023
6bb85a2
first test_feature_selection
mathias-samuelides Nov 27, 2023
0fda4fc
update todo
mathias-samuelides Nov 27, 2023
0357091
minor
mathias-samuelides Nov 27, 2023
29c872e
remove bp
mathias-samuelides Nov 27, 2023
3f6deed
in progress
mathias-samuelides Nov 27, 2023
f5bd0f3
icu feature preprocessing
mathias-samuelides Nov 28, 2023
49a088c
update raw files
mathias-samuelides Nov 28, 2023
b6401a1
non icu features preprocessing
mathias-samuelides Nov 28, 2023
2dcbe51
clean
mathias-samuelides Nov 28, 2023
98199fa
split raw_files
mathias-samuelides Nov 28, 2023
6e79d45
admission inputer file
mathias-samuelides Nov 28, 2023
ec72790
ndc file
mathias-samuelides Nov 28, 2023
f00b154
refact wip
mathias-samuelides Nov 28, 2023
544dafc
rename file
mathias-samuelides Nov 28, 2023
fb65ebc
fix existing model generation icu
mathias-samuelides Nov 29, 2023
68140ad
prediction task
mathias-samuelides Nov 29, 2023
9cb601e
rename rawdataloader
mathias-samuelides Nov 29, 2023
1e1771d
minor
mathias-samuelides Nov 29, 2023
dfaf5a6
preproc files
mathias-samuelides Nov 30, 2023
9ae5d58
minor
mathias-samuelides Nov 30, 2023
6f1be09
minor
mathias-samuelides Nov 30, 2023
3fb5f13
minor
mathias-samuelides Nov 30, 2023
5e28465
refact extractor preprocessing
mathias-samuelides Nov 30, 2023
5cca228
replace path arguments with dataframe
mathias-samuelides Nov 30, 2023
0f06c22
rename ... extractor
mathias-samuelides Nov 30, 2023
409d8a0
review
mathias-samuelides Nov 30, 2023
ecaada1
refact cohort extractor
mathias-samuelides Nov 30, 2023
dcfc65c
refact feature extraction
mathias-samuelides Dec 1, 2023
2f16214
minor
mathias-samuelides Dec 1, 2023
0b39f80
wip
mathias-samuelides Dec 1, 2023
e0a5c98
wip
mathias-samuelides Dec 1, 2023
a69f70d
wip
mathias-samuelides Dec 1, 2023
3dac5da
minor
mathias-samuelides Dec 1, 2023
ef9eb6c
new readme
mathias-samuelides Dec 2, 2023
a9058ec
wip
mathias-samuelides Dec 3, 2023
137934f
wip
mathias-samuelides Dec 4, 2023
ecbe337
feature
mathias-samuelides Dec 4, 2023
1532de5
refact features wip
mathias-samuelides Dec 5, 2023
ac574d9
summary refact
mathias-samuelides Dec 5, 2023
e50a031
remove commented code
mathias-samuelides Dec 5, 2023
ea045de
typo
mathias-samuelides Dec 5, 2023
049a7bd
renaming
mathias-samuelides Dec 5, 2023
7383995
wip
mathias-samuelides Dec 7, 2023
f829005
wip
mathias-samuelides Dec 7, 2023
cdb977e
wip
mathias-samuelides Dec 7, 2023
fb7a740
wip
mathias-samuelides Dec 8, 2023
2e2d8ed
class cohort
mathias-samuelides Dec 8, 2023
5ed6a09
...
mathias-samuelides Dec 8, 2023
6555435
...
mathias-samuelides Dec 8, 2023
56348ba
docstring
mathias-samuelides Dec 8, 2023
9afa74f
.
mathias-samuelides Dec 8, 2023
3b050c5
.
mathias-samuelides Dec 8, 2023
3efe389
minor
mathias-samuelides Dec 8, 2023
0aed766
remove useless code
mathias-samuelides Dec 8, 2023
c5ddc8f
remove hardcoded string
mathias-samuelides Dec 8, 2023
71137b8
.
mathias-samuelides Dec 8, 2023
ad3a114
.
mathias-samuelides Dec 8, 2023
a839fca
test icd converter
mathias-samuelides Dec 8, 2023
7e3c93c
fix
mathias-samuelides Dec 8, 2023
abb194c
.
mathias-samuelides Dec 8, 2023
1c8c680
feature extract_from
mathias-samuelides Dec 8, 2023
1fac5ec
.
mathias-samuelides Dec 8, 2023
c838a90
.
mathias-samuelides Dec 8, 2023
b60b6d6
.
mathias-samuelides Dec 9, 2023
e5b939d
.
mathias-samuelides Dec 9, 2023
08f4973
.
mathias-samuelides Dec 9, 2023
a3ff7bf
.
mathias-samuelides Dec 9, 2023
e39ac67
summarizer
mathias-samuelides Dec 9, 2023
39c70be
.
mathias-samuelides Dec 9, 2023
508ea8c
feature preprocessor
mathias-samuelides Dec 9, 2023
dcf22a7
.
mathias-samuelides Dec 9, 2023
295c41a
generator
mathias-samuelides Dec 10, 2023
275a584
remove cohort fields from feature classes
mathias-samuelides Dec 10, 2023
ce90791
.
mathias-samuelides Dec 10, 2023
7add049
.
mathias-samuelides Dec 10, 2023
c38de8a
gen med
mathias-samuelides Dec 10, 2023
055577a
.
mathias-samuelides Dec 10, 2023
8c89746
.
mathias-samuelides Dec 10, 2023
08a5c32
.
mathias-samuelides Dec 10, 2023
8e045d8
.
mathias-samuelides Dec 10, 2023
01da5f4
empty dict maker
mathias-samuelides Dec 10, 2023
4610bb3
.
mathias-samuelides Dec 10, 2023
17d6b40
temp file
mathias-samuelides Dec 11, 2023
a01dc09
add preproc data to gitignore
mathias-samuelides Dec 11, 2023
2cb620b
debug files and feature name
mathias-samuelides Dec 11, 2023
3c7af19
test with feature name
mathias-samuelides Dec 11, 2023
ca60898
.
mathias-samuelides Dec 11, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,10 @@ mimic-iv-1.0
scrap
*.gzip
*.csv.gz
*summary*.txt
*summary*.txt
venv
__pycache__
raw_data
preproc_data
data
*.csv
95 changes: 95 additions & 0 deletions README copy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# MIMIC-IV
**MIMIC-IV data pipeline** is an end-to-end pipeline that offers a configurable framework to prepare MIMIC-IV data for the downstream tasks.
The pipeline cleans the raw data by removing outliers and allowing users to impute missing entries.
It also provides options for the clinical grouping of medical features using standard coding systems for dimensionality reduction.
All of these options are customizable for the users, allowing them to generate a personalized patient cohort.
The customization steps can be recorded for the reproducibility of the overall framework.
The pipeline produces a smooth time-series dataset by binning the sequential data into equal-length time intervals and allowing for filtering of the time-series length according to the user's preferences.
Besides the data processing modules, our pipeline also includes two additional modules for modeling and evaluation.
For modeling, the pipeline includes several commonly used sequential models for performing prediction tasks.
The evaluation module offers a series of standard methods for evaluating the performance of the created models.
This module also includes options for reporting individual and group fairness measures.

##### Citing MIMIC-IV Data Pipeline:
MIMIC-IV Data Pipeline is available on [ML4H](https://proceedings.mlr.press/v193/gupta22a/gupta22a.pdf).
If you use MIMIC-IV Data Pipeline, we would appreciate citations to the following paper.

```
@InProceedings{gupta2022extensive,
title = {{An Extensive Data Processing Pipeline for MIMIC-IV}},
author = {Gupta, Mehak and Gallamoza, Brennan and Cutrona, Nicolas and Dhakal, Pranjal and Poulain, Raphael and Beheshti, Rahmatollah},
booktitle = {Proceedings of the 2nd Machine Learning for Health symposium},
pages = {311--325},
year = {2022},
volume = {193},
series = {Proceedings of Machine Learning Research},
month = {28 Nov},
publisher = {PMLR},
url = {https://proceedings.mlr.press/v193/gupta22a.html}
}
```

## Table of Contents:
- [Steps to download MIMIC-IV dataset for the pipeline](#Steps-to-download-MIMIC-IV-dataset-for-the-pipeline)
- [Repository Structure](#Repository-Structure)
- [How to use the pipeline?](#How-to-use-the-pipeline)

### Steps to download MIMIC-IV dataset for the pipeline

Go to https://physionet.org/content/mimiciv/1.0/

Follow instructions to get access to MIMIC-IV dataset.

Download the files using your terminal: wget -r -N -c -np --user mehakg --ask-password https://physionet.org/files/mimiciv/1.0/

### Repository Structure

- **mainPipeline.ipynb**
is the main file to interact with the pipeline. It provides step-step by options to extract and pre-process cohorts.
- **./data**
consists of all data files stored during pre-processing
- **./cohort**
consists of files saved during cohort extraction
- **./features**
consist of files containing features data for all selected features.
- **./summary**
consists of summary files for all features.
It also consists of file with list of variables in all features and can be used for feature selection.
- **./dict**
consists of dictionary structured files for all features obtained after time-series representation
- **./output**
consists output files saved after training and testing of model. These files are used during evaluation.
- **./mimic-iv-1.0**
consist of files downloaded from MIMIC-IV website.
- **./saved_models**
consists of models saved during training.
- **./preprocessing**
- **./day_intervals_preproc**
- **day_intervals_cohort.py** file is used to extract samples, labels and demographic data for cohorts.
- **disease_cohort.py** is used to filter samples based on diagnoses codes at time of admission
- **./hosp_module_preproc**
- **feature_selection_hosp.py** is used to extract, clean and summarize selected features for non-ICU data.
- **feature_selection_icu.py** is used to extract, clean and summarize selected features for ICU data.
- **./model**
- **train.py**
consists of code to create batches of data according to batch_size and create, train and test different models.
- **Mimic_model.py**
consist of different model architectures.
- **evaluation.py**
consists of class to perform evaluation of results obtained from models.
This class can be instantiated separated for use as standalone module.
- **fairness.py**
consists of code to perform fairness evaluation.
It can also be used as standalone module.
- **parameters.py**
consists of list of hyperparameters to be defined for model training.
- **callibrate_output**
consists of code to calibrate model output.
It can also be used as standalone module.

### How to use the pipeline?
- After downloading the repo, open **mainPipeline.ipynb**.
- **mainPipeline.ipynb**, contains sequential code blocks to extract, preprocess, model and train MIMIC-IV EHR data.
- Follow each code bloack and read intructions given just before each code block to run code block.
- Follow the exact file paths and filenames given in instructions for each code block to run the pipeline.
- For evaluation module, clear instructions are provided on how to use it as a standalone module.
21 changes: 20 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,25 @@ Download the files using your terminal: wget -r -N -c -np --user mehakg --ask-pa
### How to use the pipeline?
- After downloading the repo, open **mainPipeline.ipynb**.
- **mainPipeline.ipynb**, contains sequential code blocks to extract, preprocess, model and train MIMIC-IV EHR data.
- Follow each code bloack and read intructions given just before each code block to run code block.
- Follow each code block and read intructions given just before each code block to run code block.
- Follow the exact file paths and filenames given in instructions for each code block to run the pipeline.
- For evaluation module, clear instructions are provided on how to use it as a standalone module.

### Pipeline details

#### Cohort extraction
Options:
- use icu data


#### Feature extraction

#### Feature preprocessing

##### Preprocessing

##### Summary

##### Selection

##### Event Cleaning
9 changes: 9 additions & 0 deletions _old_requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
import_ipynb==0.1.3
ipywidgets==7.5.1
Jinja2==2.11.2
matplotlib==3.2.2
numpy==1.18.5
pandas==1.0.5
scikit_learn==1.0.2
torch==1.6.0
tqdm==4.47.0
Loading