Ensembling Methodologies to Predict Medical Expenses Among Smokers and Non-Smokers

Authors: Joel Laskow, Oluwadamilola Owolabi, and Simi Augustine

Date: February 2, 2024

Overview

This project employs advanced machine learning techniques to predict medical expenses based on insurance data. Using methodologies such as multiple linear regression (MLR), random forest, K-nearest neighbors (KNN), and ensemble modeling, the analysis seeks to identify key predictors of medical costs. The final ensembled model integrates multiple algorithms, validated through Leave One Out Cross Validation (LOOCV) and K-Fold Cross Validation to ensure robust predictions.

Dataset

The dataset contains 1,338 records with the following features:

Age: Continuous, patient age in years.
BMI: Continuous, body mass index (kg/m²).
Children: Continuous, number of children covered by insurance.
Charges: Continuous, yearly medical expenses (target variable).
Gender (Sex): Categorical, male or female.
Smoker Status: Categorical, yes or no.
Region: Categorical, geographic region (northeast, northwest, southeast, southwest).

Data Source:

Key Features

1. Data Preprocessing

Missing values were assessed using naniar::vis_miss(); no missing values were found.
Dummy variables were created for categorical data to facilitate regression and machine learning models.
Feature scaling was performed for KNN models.

2. Exploratory Data Analysis (EDA)

Visualized relationships between variables using histograms, scatter plots, and boxplots.
Explored the effects of smoker and region on medical expenses using facet plots.
Conducted statistical tests to confirm relationships between predictors and charges.

3. Modeling Methodologies

a. Multiple Linear Regression (MLR)

Constructed both basic and complex MLR models using forward selection.
Validated using 10-fold cross-validation to evaluate performance.

b. Random Forest

Built a random forest model to assess feature importance and predict medical expenses.
Plotted variable contributions using a bar chart.

c. K-Nearest Neighbors (KNN)

Developed both full and reduced KNN models:
- Full Model: Used all features.
- Reduced Model: Focused on top predictors (bmi, age, smoker status) identified by random forest.

d. Ensemble Modeling

Combined predictions from Random Forest, KNN, and MLR models.
Validated ensemble performance using LOOCV and K-Fold Cross Validation.

Results

Random Forest:
- Best model with an RMSE of 4,594.33.
- Most influential predictors: age, bmi, and smoker status.
KNN:
- Full Model: RMSE = 15,294.78 at k=5.
- Reduced Model: RMSE = 15,241.1 at k=4.
- Reduced model offers similar performance with fewer predictors, enhancing interpretability.
Ensemble Model:
- LOOCV RMSE = 7,736.63.
- K-Fold Cross Validation RMSE = 6,071.79.
- While ensemble RMSE is higher than Random Forest alone, it offers balanced predictions across models.

Conclusion

The project demonstrates that:

Random Forest outperforms other individual models in terms of predictive accuracy.
Ensemble models can achieve robust predictions by combining strengths of multiple algorithms.
Additional data and stratified sampling could further enhance model reliability and generalizability.

Future Work

Incorporate additional variables (e.g., smoking frequency, family medical history).
Perform stratified sampling to better represent minority groups (e.g., smokers).
Explore deep learning models to capture nonlinear relationships in the data.
Enhance ensemble methodologies with weighted averages or stacking.

Tools & Libraries

Programming Language: R
Key Libraries:
- ggplot2, dplyr, caret, randomForest, glmnet, naniar
- FNN (for KNN), fastDummies (for dummy variable creation)

Code Execution

To reproduce the results, follow these steps:

Install the required libraries:

install.packages(c("ggplot2", "dplyr", "caret", "randomForest", "glmnet", "naniar", "FNN", "fastDummies"))

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Final Markdown.Rmd		Final Markdown.Rmd
Final-Markdown.docx		Final-Markdown.docx
MSDS 6372 Project 1 Description 2024.docx		MSDS 6372 Project 1 Description 2024.docx
README.md		README.md
insurance.csv		insurance.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ensembling Methodologies to Predict Medical Expenses Among Smokers and Non-Smokers

Authors: Joel Laskow, Oluwadamilola Owolabi, and Simi Augustine

Date: February 2, 2024

Overview

Dataset

Key Features

1. Data Preprocessing

2. Exploratory Data Analysis (EDA)

3. Modeling Methodologies

a. Multiple Linear Regression (MLR)

b. Random Forest

c. K-Nearest Neighbors (KNN)

d. Ensemble Modeling

Results

Conclusion

Future Work

Tools & Libraries

Code Execution

About

Releases

Packages

DamilolaOwolabi/DS-6372-PROJECT-1

Folders and files

Latest commit

History

Repository files navigation

Ensembling Methodologies to Predict Medical Expenses Among Smokers and Non-Smokers

Authors: Joel Laskow, Oluwadamilola Owolabi, and Simi Augustine

Date: February 2, 2024

Overview

Dataset

Key Features

1. Data Preprocessing

2. Exploratory Data Analysis (EDA)

3. Modeling Methodologies

a. Multiple Linear Regression (MLR)

b. Random Forest

c. K-Nearest Neighbors (KNN)

d. Ensemble Modeling

Results

Conclusion

Future Work

Tools & Libraries

Code Execution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages