This was a research project conducted by my group member (Shadi Chamseddine) and I for our STAT 5703 W Data Mining course.
We use a Random Forest model to predict an NBA player’s position based solely on their in-game statistics. Our data is NBA player-level game statistics from the 1982 season to the 2019 season inclusive from Basketball-Reference.com. The 1982 season was chosen as our starting point because it was the earliest season in which all the relevant data variables for our analysis began being tracked and collected by the NBA. The dataset is then fed into the Random Forest model with default parameters to identify the most important variables to the model. Variable importance is determined by the mean decrease Gini for each variable. Since the number of variables is very large, we run the Random Forest model again using only the most important variables from the first run, which we identify by boosting our variable set. We reduce the dimensions by keeping only the top 50% most important variables, which we later use as the parameters in our refined 4 models.
Following this, we prepare our training sets and test sets by assigning a 70% and 30% split of the overall data respectively. With our training and test sets prepared, the next step is to tune the parameters of the Random Forest model in order to increase the accuracy of our results. Even though the Random Forest model creates bootstrap training sets to build the individual decision trees, 33% of the data is unused. Which is the OOB dataset, but a separate test set is still required since we are using the OOB errors to tune parameters of the Random Forest model. We begin by feeding our training set into another Random Forest model in order to find an optimal number of decision trees to be created. Once this parameter has been optimized, the training set is then fed into a separate Random Forest model in order to find an optimal number of variables to split by at each node in each created decision tree. With our optimized parameters, we run a final Random Forest model on the training set and then use the results to predict on the test set.
Our final Random Forest model does a great job at identifying a player’s position based solely on their in game statistics. As the accuracy rate of the model on the set aside test set is 71.7%, this means the model correctly classifies a player’s position 71.7% of the time. By boosting and tuning our model, we are able to achieve greater classification accuracy.