This project employs advanced machine learning techniques to predict medical expenses based on insurance data. Using methodologies such as multiple linear regression (MLR), random forest, K-nearest neighbors (KNN), and ensemble modeling, the analysis seeks to identify key predictors of medical costs. The final ensembled model integrates multiple algorithms, validated through Leave One Out Cross Validation (LOOCV) and K-Fold Cross Validation to ensure robust predictions.
The dataset contains 1,338 records with the following features:
- Age: Continuous, patient age in years.
- BMI: Continuous, body mass index (kg/m²).
- Children: Continuous, number of children covered by insurance.
- Charges: Continuous, yearly medical expenses (target variable).
- Gender (Sex): Categorical,
male
orfemale
. - Smoker Status: Categorical,
yes
orno
. - Region: Categorical, geographic region (
northeast
,northwest
,southeast
,southwest
).
Data Source:
- Missing values were assessed using
naniar::vis_miss()
; no missing values were found. - Dummy variables were created for categorical data to facilitate regression and machine learning models.
- Feature scaling was performed for KNN models.
- Visualized relationships between variables using histograms, scatter plots, and boxplots.
- Explored the effects of
smoker
andregion
on medical expenses using facet plots. - Conducted statistical tests to confirm relationships between predictors and charges.
- Constructed both basic and complex MLR models using forward selection.
- Validated using 10-fold cross-validation to evaluate performance.
- Built a random forest model to assess feature importance and predict medical expenses.
- Plotted variable contributions using a bar chart.
- Developed both full and reduced KNN models:
- Full Model: Used all features.
- Reduced Model: Focused on top predictors (
bmi
,age
,smoker status
) identified by random forest.
- Combined predictions from Random Forest, KNN, and MLR models.
- Validated ensemble performance using LOOCV and K-Fold Cross Validation.
-
Random Forest:
- Best model with an RMSE of 4,594.33.
- Most influential predictors:
age
,bmi
, andsmoker status
.
-
KNN:
- Full Model: RMSE = 15,294.78 at k=5.
- Reduced Model: RMSE = 15,241.1 at k=4.
- Reduced model offers similar performance with fewer predictors, enhancing interpretability.
-
Ensemble Model:
- LOOCV RMSE = 7,736.63.
- K-Fold Cross Validation RMSE = 6,071.79.
- While ensemble RMSE is higher than Random Forest alone, it offers balanced predictions across models.
The project demonstrates that:
- Random Forest outperforms other individual models in terms of predictive accuracy.
- Ensemble models can achieve robust predictions by combining strengths of multiple algorithms.
- Additional data and stratified sampling could further enhance model reliability and generalizability.
- Incorporate additional variables (e.g., smoking frequency, family medical history).
- Perform stratified sampling to better represent minority groups (e.g., smokers).
- Explore deep learning models to capture nonlinear relationships in the data.
- Enhance ensemble methodologies with weighted averages or stacking.
- Programming Language: R
- Key Libraries:
ggplot2
,dplyr
,caret
,randomForest
,glmnet
,naniar
FNN
(for KNN),fastDummies
(for dummy variable creation)
To reproduce the results, follow these steps:
- Install the required libraries:
install.packages(c("ggplot2", "dplyr", "caret", "randomForest", "glmnet", "naniar", "FNN", "fastDummies"))