This project focuses on developing a predictive model to diagnose diabetes in Pima Indians using machine learning techniques. The dataset includes various medical predictor variables and a target variable indicating diabetes presence.
- Features: Includes medical metrics such as the number of pregnancies, BMI, insulin levels, age, and more.
- Target: Diabetes outcome (
0
= No,1
= Yes).
-
Exploratory Data Analysis (EDA)
- Conducted comprehensive data cleansing.
- Performed detailed analysis of each feature to ensure data quality and integrity.
-
Feature Engineering
- Created new features and assessed their impact using SHAP (Shapley Additive Explanations).
-
Synthetic Minority Oversampling Technique (SMOTE)
- Applied SMOTE to address class imbalance by generating synthetic samples for the minority class, enhancing model performance.
-
Model Implementation
- Random Forest: Achieved 93% accuracy. Feature importance was analyzed using SHAP, revealing that low insulin levels and the interaction between age and insulin are highly influential.
- Deep Learning Model: Evaluated for performance; results were comparable to the Random Forest model.
-
Deployment
- Model Selection: Random Forest was chosen for deployment due to its interpretability and efficiency.
- Streamlit Application: Developed for easy prediction, with integrated preprocessing and feature engineering.
- Influential Features: Low insulin values and the interaction between age and insulin significantly affect predictions.
- Less Impactful Features: Features like blood pressure and number of pregnancies have minimal effect and can be omitted to reduce computational costs.
To install the necessary dependencies, run:
pip install -r requirements.txt
-
Run the Streamlit Application
streamlit run app.py
-
Input Data
- Prepare a pandas DataFrame with the same structure as the training dataset.
- The Streamlit app will handle preprocessing and feature engineering, then provide predictions.
Contributions are welcome! If you have suggestions, bug reports, or feature requests, please open an issue or submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.