Skip to content

Using exploratory data analysis and k-means clustering to analyze competitive balance in soccer/football

Notifications You must be signed in to change notification settings

tara-nguyen/soccer-competitiveness-k-means-clustering

Repository files navigation

Analyzing the Competitive Balance of Different Soccer Leagues

  • AUTHOR: TARA NGUYEN
  • Project for the course Exploratory Data Analysis and Visualization at UCLA Extension
  • Completed in December 2020

Abstract

Background: Competitive balance, which refers to the degree of uncertainty regarding the outcome of a competition, is frequently debated among soccer fans and has received considerable attention both in and outside academia.

Data and research question: In this project I analyzed team performances (points per game, win proportions, etc.) in the four soccer leagues from the 2015/2016 season to the 2019/2020 season. The main research question was: Which soccer league is the most competitive?

Method and findings: The project was completed in R (except for part of the data cleaning process that was done in Excel). Through exploratory data analysis and k-means clustering, I found that, in general, the Major League Soccer was the more competitive than the Bundesliga, the La Liga, and the Premier League.

For a complete report, see the wiki page

List of files and directory in the repo

Plots - directory for plots created during data visualization

README.md - this document you are currently reading

References - directory for academic articles on competitive balance

all-form-leaguetables.csv - final data set

all-leaguetables.xlsx - season-end league tables in all 5 seasons of all 4 leagues

form-bundesliga.csv, form-epl.csv, form-laliga.csv, form-mls.csv - form tables in all 5 seasons of the Bundesliga, the EPL, the La Liga, and the MLS, respectively

leaguescompetitiveness_Analysis.R - main R script for data wrangling, visualization, and statistical analyses

leaguesfinaldat_DataWrangling.R - R script for creating the final data set

Usage Note

The dataset and R scripts are free for download and use, provided that proper credit is given.

If you mention or use any part of my research report, please provide a link to this repo.