Project Description

This project involves applying Machine Learning techniques to analyze data provided by the Parkinson Progression Markers Initiative (PPMI) example (*omics data, motor scores, and brain imaging, etc). The primary objectives are to create a predictive model for enhanced diagnosis as well as to gain a deeper understanding of the heterogeneous nature of individuals affected by Parkinson's diseasein progressive stages of their disease trajectory with variety of machine learning models, including supervised, unsupervised, and neural network models.

Author: Zainab Nazari

EBRI – European Brain Research Institute Rita Levi-Montalcini | MHPC - Master in High Performance Computing

We keep only individuals with diagnosis of Health Control or Parkinson's Disease.
We remove patients that have these gene mutations : SNCA, GBA, LRRK2, and taking dopaminergic drugs.
We remove the duplicated gene IDs which are those that carry ensembl genes with suffix _PAR_Y and their X transcripts.
We only keep genes that are either in the 19393 protein coding gene list or in 5874 long intergenic non-coding RNAs (lincRNAs) list that we obtained from the official HGNC repository (date: 31-Jan-2024).
We filter out genes with low expression levels, retaining only those genes that exhibit more than five counts in a minimum of 10% of the individuals.

Preprocessing Part II

We create factors for diagnosis, sex, clinical center, and RIN from batch factor information.
We perform differential gene expression analysis using the limma package.
We normalize factors, compute log2 counts per million, and create a design matrix with sex correction.
We filter and normalize gene expression data.
We remove batch effects using clinical center, sex, and RIN as covariates.

The preprocessing file can be found in preprocessing_part2.R, I am grateful to Ivan Arisi for sharing valuable information with me regarding this aspect.

ML AdaBoost

The code performs machine learning analysis using AdaBoost algorithm on RNA-Seq dataset and evaluates the performance of multiple models across multiple trials.

ML with best 148 genes and using XGBoost and CatBoost:

clearly Catboost outperfom in the computaion of AUC with cross validation.

RNA-Seq data from PDBP with diagnosis

We add the table where we extract the RNA-Seq of PDBP cohort from AMP-PD cohort. We only keep those individuals with parents having no PD so to keep more data for analysis.

- Proteomics/

In the file "proteomic-table.ipynb" you can find code on how to make the table which contains the proteomic csf genes with patients and their diagnosis

In the file "proteomic-ML.ipynb" you can find the code with predictive model using xgboost for dianosis of PD vs Control.

- Motor Score

In the file "motor_score.ipynb" you can find a ML test for UPDRS total score.

- UPSIT

University of Pennsylvania Smell Identification Test, in the file for ppmi cohort in ppmi_UPSIT.ipynb and pdbp cohort in pdbp_UPSIT.ipynb

- Plots

Distribution of Participants Diagnosis, Ages and Across Different Visits in the file: plots.ipynb

- External_Data/

Some external data that is needed for this study.

- Draft/

Please ignore it.

Installation

In the file conda_list.txt you acn find all the packeges installed using conda.

Contact

If you have any questions/suggestion or want to contribute feel to contact me: [email protected]

Acknowledgement

I am grateful to Ivan Arisi for sharing valuable information with me regarding this project and particularly for the prepresossesing STEP II as well as ML learning algorithm with AdaBoost.

Last update : 2024-05-22

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
Draft		Draft
External_data		External_data
Proteomics		Proteomics
RNA-Seq		RNA-Seq
UPSIT		UPSIT
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
16aprile_2024_mhpc_tesi.pdf		16aprile_2024_mhpc_tesi.pdf
README.md		README.md
conda_list.txt		conda_list.txt
excluded_patients.ipynb		excluded_patients.ipynb
motor_score.ipynb		motor_score.ipynb
plots.ipynb		plots.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Description

Table of Contents

- Excluding Pateints

- RNA Sequencing /

Preprocessing Part I

Preprocessing Part II

ML AdaBoost

ML with best 148 genes and using XGBoost and CatBoost:

RNA-Seq data from PDBP with diagnosis

- Proteomics/

- Motor Score

- UPSIT

- Plots

- External_Data/

- Draft/

About

Releases

Packages

Languages

zainabnazari/ppmi

Folders and files

Latest commit

History

Repository files navigation

Project Description

Table of Contents

- Excluding Pateints

- RNA Sequencing /

Preprocessing Part I

Preprocessing Part II

ML AdaBoost

ML with best 148 genes and using XGBoost and CatBoost:

RNA-Seq data from PDBP with diagnosis

- Proteomics/

- Motor Score

- UPSIT

- Plots

- External_Data/

- Draft/

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages