- Overview
- Dataset
- Data Preprocessing
- Exploratory Data Analysis (EDA)
- Statistical Analysis
- Machine Learning Models
- Confusion Matrix Visualization
- Requirements
- How to Run the Code
- Conclusion
This project analyzes sleep patterns and their correlation with various health metrics using machine learning techniques. By examining the relationship between sleep duration, stress levels, and other health indicators, the project aims to predict the presence of sleep disorders among individuals. This analysis can aid in identifying high-risk populations and informing interventions.
The dataset contains various features related to sleep health and lifestyle choices. Key columns include:
- Gender: Categorical variable indicating gender.
- Occupation: Categorical variable indicating occupation.
- BMI Category: Categorical variable indicating Body Mass Index classification.
- Blood Pressure: Numerical value indicating blood pressure.
- Sleep Disorder: Target variable indicating the presence of a sleep disorder.
- Sleep Duration: Numerical value representing average sleep duration.
- Stress Level: Numerical value indicating daily stress level.
Gender | Occupation | BMI Category | Blood Pressure | Sleep Disorder | Sleep Duration | Stress Level |
---|---|---|---|---|---|---|
Male | Engineer | Normal | 120 | No | 7 | 3 |
Female | Teacher | Overweight | 135 | Yes | 5 | 7 |
- Loading the Data: The dataset is loaded using Pandas, providing a DataFrame structure for easier manipulation.
- Handling Missing Values: A preliminary check is conducted to identify and quantify missing values across all columns, with potential strategies for imputation discussed.
- Encoding Categorical Variables: Categorical variables are converted to numerical codes using
astype('category').cat.codes
, facilitating machine learning model training. - Feature Scaling: Although not explicitly included, future iterations could benefit from scaling numerical features to improve model convergence.
EDA is conducted to gain insights into the dataset:
- Histograms: Visualize the distribution of sleep duration with a Kernel Density Estimate (KDE) overlay.
- Box Plots: Highlight potential outliers in sleep duration.
- Scatter Plots: Examine relationships between sleep duration and stress levels.
- Correlation Heatmap: Analyze correlations among features.
- Pearson Correlation: Measures linear relationships between sleep duration and blood pressure.
- Spearman Correlation: Assesses monotonic relationships.
- A regression plot visualizes the relationship between sleep duration and blood pressure.
Multiple models are employed to predict sleep disorders:
- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier
- AdaBoost Classifier
- Recursive Feature Elimination (RFE)
Models are evaluated using:
- Classification Reports: Detailing precision, recall, F1-score, and support for each class.
- ROC AUC Scores: Provides a single metric for assessing model performance.
Grid Search is applied to optimize hyperparameters for the Random Forest Classifier.
Confusion matrices are generated for each model to visualize performance metrics, including true positives, false positives, true negatives, and false negatives.
To run this project, you will need the following Python libraries:
pandas
numpy
matplotlib
seaborn
scipy
scikit-learn
Install the required packages using:
pip install -r requirements.txt
- Clone the repository to your local machine:
git clone https://github.com/aboodcs/SleepHealthAnalysis cd SleepHealthAnalysis
This project demonstrates the application of data analysis and machine learning techniques to investigate sleep health patterns. It highlights the importance of sleep duration in relation to health indicators and the effectiveness of various models in predicting sleep disorders. Insights derived from this analysis could inform public health strategies aimed at improving sleep health.