This project aims to predict the number of dislikes on YouTube videos using various regression techniques, including Linear Regression, Lasso, and Ridge Regression. The project involves several steps, including Exploratory Data Analysis (EDA), feature engineering, encoding, cross-validation, model evaluation, and residual analysis.
- Overview
- Kaggle Notebook
- Data Description
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Encoding
- Modeling
- Cross-Validation
- Residual Analysis
- Conclusion
You can find the complete notebook for this project on Kaggle here.
The dataset includes various features of YouTube videos, such as upload date, uploader information, view count, like count, and more. The target variable is dislike_count
.
During the EDA phase, we checked for missing values and explored relationships between features and the target variable. Key insights from EDA helped guide feature engineering and model selection.
Feature engineering involved several steps:
- Datetime Conversion: Converted
upload_date
to datetime format and calculated the age of the video in days. - Log Transformation: Applied log transformation to
uploader_sub_count
to handle skewness. - Dropping Irrelevant Features: Removed features such as
upload_date
,title
, anddescription
.
Categorical variables were encoded using one-hot encoding to convert them into a format suitable for regression models. The variables encoded include:
has_subtitles
is_comments_enabled
is_ads_enabled
is_live_content
is_age_limit
A simple linear regression model was trained using the processed features. Cross-validation was performed to evaluate the model's performance.
Lasso regression was used to introduce regularization to the model, helping to prevent overfitting. Grid search was used to find the best alpha value for the Lasso model.
Ridge regression, another regularization technique, was also employed. Grid search was used to find the optimal alpha value for the Ridge model.
Cross-validation was used to evaluate the performance of each model. The Root Mean Squared Error (RMSE) was the primary evaluation metric.
Residual analysis was conducted to assess the fit of the linear regression model. A scatter plot of residuals versus fitted values indicated potential issues with model fit.
The project explored multiple regression techniques to predict YouTube dislikes. Linear regression, Lasso, and Ridge regression models were evaluated, with each model showing similar performance in terms of RMSE. Residual analysis suggested that linear regression might not be the best model for this task, indicating the need for further exploration of more advanced models or feature engineering techniques.