Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added XGBoost tutorial #7

Merged
merged 2 commits into from
Aug 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,349 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# system packages\n",
"from datetime import datetime, date, timedelta\n",
"import pickle\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")\n",
"import platform\n",
"import time\n",
"from tqdm import tqdm\n",
"import os\n",
"import boto3\n",
"from botocore.client import Config\n",
"from botocore import UNSIGNED\n",
"\n",
"# basic packages\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib.dates as mdates\n",
"from matplotlib.patches import Patch\n",
"import math\n",
"from evaluation_table import EvalTable\n",
"\n",
"# model packages\n",
"import xgboost as xgb\n",
"from sklearn.model_selection import GridSearchCV, train_test_split, RepeatedKFold, cross_val_score\n",
"from sklearn.metrics import mean_squared_error, mean_absolute_error, make_scorer\n",
"from sklearn.preprocessing import MinMaxScaler\n",
"import joblib\n",
"from shapely.geometry import Point\n",
"import geopandas as gpd\n",
"import pyproj\n",
"\n",
"# Identify the path\n",
"home = os.getcwd()\n",
"parent_path = os.path.dirname(home)\n",
"input_path = f'{parent_path}/02.input/'\n",
"output_path = f'{parent_path}/03.output/'\n",
"main_path = home"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load the train and test dataset\n",
"data_train = pd.read_pickle(f\"{output_path}train_dataset.pkl\")\n",
"data_test = pd.read_pickle(f\"{output_path}test_dataset.pkl\")\n",
"station_list = data_train.station_id.unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 5.2. Scaling the Data\n",
"Generally, scaling the inputs is not required in decision-tree ensemble models. However, some studies suggest scaling the inputs since XGBoost uses the Gradient Decent algorithm in its core optimization. So here we will try both \n",
"scaled and unscaled inputs to see the difference.\n",
"We will scale the data by using the *MinMaxScaler()* function from the Scikit-Learn library. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Function for scaling the data. \n",
"def input_scale(x_train, y_train):\n",
"\n",
" scaler_x = MinMaxScaler()\n",
" scaler_y = MinMaxScaler()\n",
" x_train_scaled, y_train_scaled = \\\n",
" scaler_x.fit_transform(x_train), scaler_y.fit_transform(y_train.values.reshape(-1, 1)).reshape(-1)\n",
" joblib.dump(scaler_x, f'{output_path}scaler_x.joblib')\n",
" joblib.dump(scaler_y, f'{output_path}scaler_y.joblib')\n",
" return x_train_scaled, y_train_scaled, scaler_x, scaler_y\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 5.3. Automatic Hypermeter Tuning\n",
"We investigated different values for each parameter, trying to tune it manually, which shows how difficult and time-consuming this process is. So, the next step is to use a simple automatic tuning method, Grird Search, to find the optimal hyperparameter values. To do so, we will use the *GirdSearchCV()* function of the Scikit-Learn library. \n",
"This method gets possible values for each parameter and then tests all possible combinations one by one. It finds the optimal value, but it is very slow. It also uses cross-validation to evaluate each combination. \n",
"\n",
"First, we divide and scale our whole dataset for the automatic tuning. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Initialize dictionaries for storing scaled test datasets.\n",
"x_test_scaled = {}\n",
"y_test_scaled = {}\n",
"x_test = {} # Missing declaration in your provided code.\n",
"y_test = {} # Missing declaration in your provided code.\n",
"\n",
"\n",
"# Assigning features by selecting all but the last column from the data_train DataFrame and resetting the index.\n",
"x_train = data_train.iloc[:, 2:-1].reset_index(drop=True)\n",
"# Assigning the target by selecting the last column from the data_train DataFrame and resetting the index.\n",
"y_train = data_train.iloc[:, -1].reset_index(drop=True)\n",
"\n",
"# Scale the training data and retrieve the scalers for later use on the test data.\n",
"x_train_scaled, y_train_scaled, scaler_x, scaler_y = input_scale(x_train, y_train)\n",
"\n",
"# Loop over each station name from the list of station IDs.\n",
"for station_name in station_list:\n",
" # Extract and store the features for the test data for each station.\n",
" x_test[station_name] = data_test[data_test.station_id == station_name].iloc[:, 2:-1]\n",
" # Extract and store the target variable for the test data for each station.\n",
" y_test[station_name] = data_test[data_test.station_id == station_name].iloc[:, -1]\n",
" # Scale the extracted test features and targets using the previously fitted scalers.\n",
" x_test_scaled[station_name] = scaler_x.transform(x_test[station_name])\n",
" y_test_scaled[station_name] = scaler_y.transform(y_test[station_name].values.reshape(-1, 1)).reshape(-1)\n",
"\n",
"\n",
"with open(f\"{output_path}x_tes.pkl\", 'wb') as file:\n",
" pickle.dump(x_test_scaled, file)\n",
"with open(f\"{output_path}y_test.pkl\", 'wb') as file:\n",
" pickle.dump(y_test_scaled, file)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, we select the possible values or range of values. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Define the range of hyperparameters for XGBoost tuning.\n",
"# Note that 'range(100, 300, 200)' implies a single value because the step size leads directly to the limit.\n",
"# If you intend multiple steps, adjust the range appropriately.\n",
"hyperparameters_xgboost = {\n",
" 'max_depth': range(2, 4), # Generates [2, 3] because 'range' is exclusive of the stop value.\n",
" 'n_estimators': range(100, 1000, 200), \n",
" 'eta': [0.1, 0.01, 0.05] \n",
"}\n",
"\n",
"# Paths for saving the tuned hyperparameters and the trained model.\n",
"path_model_save_hyperparameters = f\"{output_path}best_model_hyperparameters_xgboost.pkl\"\n",
"path_model_save_model = f\"{output_path}best_model_xgboost.pkl\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we create a new XGBoost model and the grid search function. Then, we will run the function and compare the results with those in the previous section. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Initialize an XGBoost regressor model.\n",
"xgboost_model_automatic = xgb.XGBRegressor()\n",
"\n",
"# Setup GridSearchCV with the XGBoost model and hyperparameter grid.\n",
"grid_search_3 = GridSearchCV(estimator=xgboost_model_automatic, # Corrected to use the initialized model\n",
" param_grid=hyperparameters_xgboost, # Dictionary of parameters to try\n",
" scoring='neg_mean_absolute_error', # Scoring method MAE, reported as negative for maximization\n",
" cv=3, # Number of cross-validation folds\n",
" n_jobs=-1, # Use all available CPU cores\n",
" verbose=0) # Show detailed progress (level 3)\n",
"\n",
"# Fit the GridSearchCV to the scaled training data.\n",
"grid_search_3.fit(x_train_scaled, y_train_scaled)\n",
"\n",
"# Retrieve the best estimator from the grid search.\n",
"optimized_xgboost_model = grid_search_3.best_estimator_\n",
"\n",
"# Output the best parameters and the corresponding score for those parameters.\n",
"print(f\"Best parameters found: {grid_search_3.best_params_}\")\n",
"print(f\"Best RMSE: {abs(grid_search_3.best_score_)}\") # Print the absolute value of the RMSE\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Remember to save the best parameters after finding them!!!!!!**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"joblib.dump(grid_search_3, path_model_save_hyperparameters)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!-- <div style=\"display: flex; justify-content: center; align-items: center; margin: auto;\">\n",
" <div style=\"margin: 10px;\">\n",
" <img src=\"../04.pic/q2.png\" alt=\"Image 1\" width=\"300\">\n",
"</div>\n",
" -->\n",
"\n",
"<center>\n",
"<strong style=\"font-size: 24px;\">What other methods can we use to tune the parameters????</strong>\n",
"</center>\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the training dataset is too big (which is the case in real-world examples), we only use a small part to tune the parameters. Then, we have to train the model on the full dataset and save the model using the code below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Fit the optimized XGBoost model to the scaled training data.\n",
"optimized_xgboost_model = optimized_xgboost_model.fit(x_train_scaled, y_train_scaled)\n",
"\n",
"# Save the trained model to a file using pickle. This serialized file can be loaded later to make predictions.\n",
"pickle.dump(optimized_xgboost_model, open(path_model_save_model, \"wb\"))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 5.4. Feature Selection\n",
"Feature selection is an important part of preprocessing the data, which we skipped since we first had to learn the model structure. After training the model, decision-tree ensembles can show us the importance of each feature in the prediction process. Then, based on the importance, we can remove less important features to make the model more complex.\n",
"\n",
"First we will try it for one station and the model that we trained with one station data. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open(f'{output_path}best_manuall_model.pkl', 'rb') as file:\n",
" xgboost_model = pickle.load(file)\n",
" \n",
"# Extract the feature names from the training dataset.\n",
"cols = x_train.columns\n",
"# Create a DataFrame containing the feature importances extracted from the optimized XGBoost model.\n",
"# Transpose the DataFrame for easier plotting (columns become rows and vice versa).\n",
"FI = pd.DataFrame(xgboost_model.feature_importances_, index=cols, columns=['Importance'])\n",
"\n",
"# Plotting the feature importances as a horizontal bar chart.\n",
"ax = FI.sort_values('Importance', ascending=True).plot.barh() # Sorting helps in better visualization.\n",
"ax.get_legend().remove() # Remove the legend since it's typically not needed for a single-variable plot.\n",
"plt.title('Feature Importance') # Setting the title of the plot.\n",
"plt.xlabel('Importance') # Adding an x-label for clarity.\n",
"plt.ylabel('Features') # Adding a y-label for clarity.\n",
"plt.show() # Ensure the plot is displayed.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we will try it for all the stations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Extract the feature names from the training dataset.\n",
"cols = x_train.columns\n",
"\n",
"# Create a DataFrame containing the feature importances extracted from the optimized XGBoost model.\n",
"# Transpose the DataFrame for easier plotting (columns become rows and vice versa).\n",
"FI = pd.DataFrame(optimized_xgboost_model.feature_importances_, index=cols, columns=['Importance'])\n",
"\n",
"# Plotting the feature importances as a horizontal bar chart.\n",
"ax = FI.sort_values('Importance', ascending=True).plot.barh() # Sorting helps in better visualization.\n",
"ax.get_legend().remove() # Remove the legend since it's typically not needed for a single-variable plot.\n",
"plt.title('Feature Importance') # Setting the title of the plot.\n",
"plt.xlabel('Importance') # Adding an x-label for clarity.\n",
"plt.ylabel('Features') # Adding a y-label for clarity.\n",
"plt.show() # Ensure the plot is displayed.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[**LETS GO TO THE NEXT PART**](./03.tutorial_post_processing_xgboost_evaluation.ipynb)"
]
}
],
"metadata": {
"interpreter": {
"hash": "7d941b521942abff02888ea7873cca51c2aac5fb2f3b440dbf15a61d263ddb0d"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading
Loading