Skip to content

Built supervised model which achieved 0.6969 AUC score for predicting NBA game’s winner team

Notifications You must be signed in to change notification settings

yhchan0918/NBA_Data_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What is NBA?

  • The National Basketball Association (NBA) is a professional basketball league in North America. It is the premier men's professional basketball league in the world

NBA Data Analysis & NBA Game Winner Team Estimator: Project Overview

  • Created a NBA Game Winner Team Estimator to determine the winner team based on previous data for home team and away team from Season 2003 - 2020
  • Managed to achieve 0.6969 AUC score with the Logistic Regression model
  • Optimized KNN, Logistic Regression, Random Forest Classifiers and XGBoost using GridsearchCV to reach the best model.
  • Built backend API using FastAPI & Built frontend app using Streamlit
  • Live Project Link: https://nba-game-predict-app.herokuapp.com/

Data Source

https://www.kaggle.com/nathanlauga/nba-games

Data Cleaning

  • Every row has 43 columns. Note: Record is calucated by total wins over sum of total wins and total losses
  • Columns: Meaning GAME_ID: ID of match
    G_home: Number of games played on the season of Home Team
    W_PCT_home: Win % on current season of Home Team
    HOME_RECORD_home: Home record on the current season of Home Team
    ROAD_RECORD_home: Road record on the current season of Home Team
    W_PCT_prev_home: Win % on previous season of Home Team
    HOME_RECORD_prev_home: Home record on the previous season of Home Team
    ROAD_RECORD_prev_home: Road record on the previous season of Home Team
    G_away: Number of games played on the current season by Away Team
    W_PCT_away: Win % on current season of Away Team
    HOME_RECORD_away: Home record on the current season of Away Team
    ROAD_RECORD_away: Road record on the current season of Away Team
    W_PCT_prev_away: Win % on previous season of Away Team
    HOME_RECORD_prev_away: Home record on the previous season of Away Team
    ROAD_RECORD_prev_away: Road record on the previous season of Away Team
    WIN_PRCT_home_3g: Mean Win % on previous 3 games of Home Team
    PTS_home_3g: Mean Number of points scored by Home Team on previous 3 games
    FG_PCT_home_3g: Mean Field Goal Percentage by Home Team on previous 3 games
    FT_PCT_home_3g: Mean Free Throw Percentage by Home Team on previous 3 games
    FG3_PCT_home_3g: Mean Three Point Percentage by Home Team on previous 3 games
    AST_home_3g: Mean Assists by Home Team on previous 3 games
    REB_home_3g: Mean Rebounds by Home Team on previous 3 games
    WIN_PRCT_away_3g: Mean Win % by Away Team on previous 3 games
    PTS_away_3g: Mean Number of points scored by Away Team on previous 3 games
    FG_PCT_away_3g: Mean Field Goal Percentage by Away Team on previous 3 games
    FT_PCT_away_3g: Mean Free Throw Percentage by Away Team on previous 3 games
    FG3_PCT_away_3g: Mean Three Point Percentage by Away Team on previous 3 games
    AST_away_3g: Mean Assists by Away Team on previous 3 games
    REB_away_3g: Mean Rebounds by Away Team on previous 3 games
    WIN_PRCT_home_10g: Mean Win % on previous 10 games of Home Team
    PTS_home_10g: Mean Number of points scored by Home Team on previous 10 games
    FG_PCT_home_10g: Mean Field Goal Percentage by Home Team on previous 10 games
    FT_PCT_home_10g: Mean Free Throw Percentage by Home Team on previous 10 games
    FG3_PCT_home_10g: Mean Three Point Percentage by Home Team on previous 10 games
    AST_home_10g: Mean Assists by Home Team on previous 10 games
    REB_home_10g: Mean Rebounds by Away Team on previous 10 games
    WIN_PRCT_away_10g: Mean Win % by Away Team on previous 10 game
    PTS_away_10g: Mean Number of points scored by Away Team on previous 10 games
    FG_PCT_away_10g: Mean Field Goal Percentage by Away Team on previous 10 games
    FT_PCT_away_10g: Mean Free Throw Percentage by Away Team on previous 10 game
    FG3_PCT_away_10g: Mean Three Point Percentage by Away Team on previous 10 games
    AST_away_10g: Mean Assists by Away Team on previous 10 games
    REB_away_10g: Mean Rebounds by Away Team on previous 10 game
    GAME_DATE_EST: Game's date
    SEASON: Season when the game occured
    HOME_TEAM_WINS: Have Home Team Win(Target Variable)

EDA

I have done some EDA for final games data. Out of curiousity, I have done EDA regarding LeBron's stats. Below are some highlights

Overall Games From All Seasons

alt text

  • From stacked bar chart from left side, looking from bottom to top, the light blue bar increases when the mean Win % on previous 3 games of Home Team increases
  • From stacked bar chart from right side, looking from bottom to top, the light blue bar increases when the mean Win % on previous 10 games of Home Team increases

alt text

  • For stacked bar chart from left side, looking from bottom to top, the light blue bar decreases when the mean Win % on previous 3 games of Away Team increases

  • For stacked bar chart from right side, looking from bottom to top, the light blue bar decreases when the mean Win % on previous 10 games of Away Team increases

  • We can conclude that

    • Higher the win % of previous games of home team, the higher chances that the home team will win
    • Higher the win % of previous games of away team, the lower chances that the home team will win

LeBron's stats

alt text alt text

Model Building

First, I use season 2004 - 2018 as train set while season 2019 as test set. I ignore data from season 2020 because of covid-19 which is an unexpected variable & causing the games in season 2020 not relatively balanced. After that, I have prepared standard-scaled data for Logistic Regression model and minmax-scaled data for K-Nearest Neighbors model.

I tried four different models and evaluated them using ROC AUC score. I chose ROC AUC score as this is a imbalanced dataset. Also, it is suitable to use ROC AUC score to evaluate the ability of model to classify true-positive & true-negative.

I tried four different models:

  • Logistic Regression
  • K-Nearest Neighbors
  • Random Forest
  • XGboost

Model performance

The Logistic Regression model slightly outperformed the other approaches using cross validation evaluation

  • Logistic Regression : ROC AUC score = 0.6969
  • K-Nearest Neighbors : ROC AUC score = 0.6519
  • Random Forest. : ROC AUC score = 0.6966
  • XGboost : ROC AUC score = 0.6961

Productionization

In this step, I built a FastAPI backend endpoint & frontend app using Streamlit. Both backend and frontend app are deployed using docker. In the end, both are deployed live using Heroku. The API endpoint takes in a request with a list of values from a home team's stats & away team's stats and returns an estimated outcome of the current game.

About

Built supervised model which achieved 0.6969 AUC score for predicting NBA game’s winner team

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages